Skip to content

[feat](compaction) Support ADMIN COMPACT TABLE type='FULL' and enable it in cloud mode#62596

Merged
luwei16 merged 3 commits intoapache:masterfrom
Yukang-Lian:feat-admin-compact-full-cloud
Apr 21, 2026
Merged

[feat](compaction) Support ADMIN COMPACT TABLE type='FULL' and enable it in cloud mode#62596
luwei16 merged 3 commits intoapache:masterfrom
Yukang-Lian:feat-admin-compact-full-cloud

Conversation

@Yukang-Lian
Copy link
Copy Markdown
Collaborator

Summary

Unify the ADMIN COMPACT TABLE ... WHERE type = '...' SQL entry so that BASE, CUMULATIVE and FULL all work on both the local and cloud (storage-compute split) engines. Before this change, FULL was rejected in FE and cloud mode was blocked outright, forcing users to curl each BE's HTTP API to run manual compaction.

What changes

FE

  • AdminCompactTableCommand: add FULL to CompactionType, relax analyzeWhere, map FULL -> \"full\" in getCompactionType, and drop the cloud-mode gate so the command dispatches via the normal Agent RPC path.
  • CloudEnv.compactTable: override Env.compactTable to walk partitions/indices/tablets, resolve the primary BE via Replica.getBackendId(), skip tablets whose compute group has no available BE, and raise DdlException if nothing was dispatched.

BE

  • submit_table_compaction_callback: strictly match base/cumulative/full instead of silently downgrading any non-\"base\" value to CUMULATIVE_COMPACTION. FULL goes through submit_compaction_task(force=false, eager=true, trigger_method=MANUAL) after setting last_full_compaction_schedule_time, matching the HTTP API default.
  • cloud_submit_table_compaction_callback: new callback mirroring the local one but calling CloudStorageEngine::submit_compaction_task; honors the cloud HTTP convention that base/cumu need sync_delete_bitmap while full does not.
  • agent_server::cloud_start_workers: register the previously stubbed-out SUBMIT_TABLE_COMPACTION worker (drops the // TODO(plat1ko) comment).

Thrift / wire compatibility

TCompactionReq.type remains optional string, so no IDL change. A new FE against an old BE would see \"full\" silently downgraded (local) or dropped (cloud), so the release order is all BEs upgraded first, then FE. FE rollback can lead BE since the FE simply stops sending \"full\".

Test plan

  • mvn -pl fe-core -am compile passes.
  • New regression suite regression-test/suites/compaction/test_admin_compact_table.groovy covers base / cumulative / full end-to-end on the local engine, plus negative cases (unknown type, missing WHERE).
  • New regression suite regression-test/suites/cloud_p0/compaction/test_cloud_admin_compact_table.groovy covers the same matrix on the cloud engine.
  • Design document and code reviewed independently by Codex (two rounds).

… it in cloud mode

ADMIN COMPACT TABLE previously only accepted BASE and CUMULATIVE in local
mode and threw Unsupported operation in cloud mode, forcing users to curl
each BE's HTTP API to run manual compaction. This change unifies the FE SQL
entry so that base/cumulative/full all work on both local and cloud.

FE
- AdminCompactTableCommand: add FULL to CompactionType, relax analyzeWhere,
  map FULL->"full" in getCompactionType, drop the cloud-mode gate so the
  command dispatches via the normal path.
- CloudEnv.compactTable: override Env.compactTable to walk partitions/
  indices/tablets, resolve the primary BE via Replica.getBackendId, and
  raise a DdlException if nothing was dispatched (e.g. the current compute
  group has no available BE).

BE
- submit_table_compaction_callback: strictly match base/cumulative/full and
  reject unknown types instead of silently downgrading non-"base" to
  cumulative. Full goes through submit_compaction_task(force=false,
  eager=true, trigger_method=MANUAL) after setting
  last_full_compaction_schedule_time, matching the HTTP API default.
- cloud_submit_table_compaction_callback: new callback mirroring the local
  one but calling CloudStorageEngine::submit_compaction_task; honors the
  cloud HTTP convention that base/cumu need sync_delete_bitmap while full
  does not.
- agent_server::cloud_start_workers: register the previously stubbed-out
  SUBMIT_TABLE_COMPACTION worker (drops the plat1ko TODO).

Thrift is untouched: TCompactionReq.type is already optional string, so the
existing "base"/"cumulative"/"full" wire format remains compatible. This
implies a release order of "all BEs upgraded first, then FE" - a new FE
against an old BE would see "full" silently downgraded (local) or dropped
(cloud).

Tests
- regression-test/suites/compaction/test_admin_compact_table.groovy:
  end-to-end coverage for base/cumulative/full under the local engine plus
  negative cases (unknown type, missing WHERE).
- regression-test/suites/cloud_p0/compaction/test_cloud_admin_compact_table.groovy:
  same matrix under the cloud engine, with a SELECT before each poll to
  force BE to sync rowsets from Meta Service.
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

…ack changes

Reflow the new submit_compaction_task calls and the cloud callback's log
messages to satisfy clang-format.
@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/42) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/81) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.27% (20297/38101)
Line Coverage 36.76% (191187/520029)
Region Coverage 33.06% (148618/449528)
Branch Coverage 34.14% (64920/190136)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 45.24% (38/84) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.74% (27512/37311)
Line Coverage 57.47% (297933/518417)
Region Coverage 54.81% (248648/453640)
Branch Coverage 56.32% (107402/190706)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 73.81% (31/42) 🎉
Increment coverage report
Complete coverage report

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run nonConcurrent

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Requesting changes for two functional blockers in the new cloud FE path and one rolling-upgrade compatibility blocker.

Findings

  1. CloudEnv.compactTable() resolves cloud backends while olapTable.readLock() is held. In cloud mode CloudReplica.getBackendId() can auto-start/wait for a compute group via Meta Service, so this can hold the table metadata lock for seconds or minutes.
  2. CloudEnv.compactTable() now swallows every UserException from backend resolution and degrades it to skip tablet ..., which hides real user-facing failures such as missing USAGE privilege, manual-shutdown compute groups, or missing compute groups.
  3. The new FULL / cloud SQL path is not rolling-upgrade safe. Old local BEs still treat full as cumulative compaction, and old cloud BEs still reject TTaskType::COMPACTION; FE does not surface that submit failure.

Critical Checkpoints

  • Goal of current task / correctness: Partially met. The PR adds FULL and cloud SQL coverage on fully upgraded clusters, but the cloud FE path and mixed-version behavior still have blocking issues. Regression tests cover the happy path only.
  • Scope / minimality: Mostly focused, but the cloud FE override folds backend resolution, auto-start behavior, and dispatch into one locked section, which expands the risk surface.
  • Concurrency / locking: BE changes look consistent with existing compaction queues. FE has a locking issue: CloudEnv.compactTable() performs compute-group resolution under olapTable.readLock(), and that path can wait on external state.
  • Lifecycle / static init: No special lifecycle or static-init problems found in the changed code.
  • Config changes: None.
  • Compatibility / rolling upgrade: Not safe today. New FE can send requests that old BEs either execute as the wrong compaction type (local) or reject silently (cloud).
  • Parallel code paths: I checked both local/cloud SQL paths and the existing HTTP compaction paths. The cloud SQL path mirrors the cloud HTTP BE behavior better than before, but the FE-side error handling and compatibility gates are still incomplete.
  • Special conditions / error handling: The new catch-all UserException handling is too broad and loses actionable user errors.
  • Test coverage: New regression tests are good for fully upgraded local/cloud happy paths and basic syntax errors, but they do not cover cloud failure cases (permission/manual shutdown/no BE) or mixed-version behavior.
  • Test result files: The new .out files are consistent with the new suites.
  • Observability: BE logs are reasonable, but FE currently suppresses the most useful cloud-side error context before it reaches the user.
  • Transaction / persistence / data writes: No new edit-log or storage-format changes. The risk here is dispatch correctness, not persistence.
  • FE-BE parameter passing: Reuses existing TCompactionReq.type string; no new wire field.
  • Performance: The main performance/concurrency concern is repeated backend resolution and potential auto-start wait under the table read lock.
  • Other issues: No additional blockers beyond the points above.

Comment thread fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudEnv.java Outdated
Comment thread fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudEnv.java Outdated
Two follow-ups on the AI review:

1. Do not hold the table read lock while resolving backends.
   CloudReplica.getBackendId() can call into Meta Service and wait for
   compute-group auto-start, which could block unrelated DDL on this
   table for the full wait. Split the method into two passes: collect
   tablet metadata (partition/index/tablet/schemaHash/replica refs)
   under the read lock, then release it and resolve backends outside.

2. Do not swallow UserException as a skip. The previous catch dropped
   real errors (missing USAGE privilege, manually shut down compute
   group, missing compute group, ...) to a LOG.warn, making failed
   ADMIN COMPACT TABLE look like a no-op. Now any UserException raised
   during backend resolution is surfaced to the user as DdlException.

Compile-only verified; existing regression tests cover the happy path.
@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run buildall

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Requesting changes for one remaining blocker in the current head revision. The cloud lock-scope and error-propagation problems from the previous revision are fixed, but the new FE SQL entry is still not mixed-version safe against older BEs.

Findings

  1. Mixed-version compatibility blocker: enabling FULL and cloud SQL dispatch from FE is still unsafe before every BE is upgraded. On the previous local BE implementation, submit_table_compaction_callback() treats any non-"base" type as cumulative compaction, so a new FE can silently run the wrong operation for type='FULL'. On the previous cloud BE implementation, cloud_start_workers() still has no TTaskType::COMPACTION worker; submit_tasks() returns a non-OK TAgentResult, but AgentBatchTask.run() ignores that result, so the SQL can return success while nothing is actually dispatched.

Critical Checkpoints

  • Goal of current task / correctness: Mostly met on a fully upgraded cluster. The new local/cloud paths line up with the intended behavior there, and the added regression suites cover the happy path. It is not fully correct across mixed FE/BE versions because a new FE can still send requests that old BEs execute incorrectly or reject silently.
  • Scope / minimality: Focused. The current head also keeps the cloud backend-resolution work outside the table read lock, which is the right shape.
  • Concurrency / locking: No remaining locking or thread-safety problem found in the touched paths after the latest cloud fix.
  • Lifecycle / static init: No special lifecycle or static-initialization issue found.
  • Config changes: None.
  • Compatibility / rolling upgrade: Not satisfied; see the finding above.
  • Parallel code paths: I checked both local/cloud SQL paths and the existing HTTP compaction paths. They now match on new BEs, but the mixed-version behavior is still wrong.
  • Special conditions / error handling: Current head correctly surfaces cloud backend-resolution failures instead of swallowing them.
  • Test coverage: The new regression tests are good for fully upgraded local/cloud happy paths and syntax negatives, but they do not cover mixed-version behavior.
  • Test result files: The added .out files are consistent with the new suites.
  • Observability: Adequate for the touched paths.
  • Transaction / persistence / data writes: No persistence or transaction-correctness issue found in this change.
  • FE-BE parameter passing: Reuses TCompactionReq.type; the remaining problem is how older BEs interpret or reject the new FE behavior.
  • Performance: No remaining performance blocker after the cloud lock-scope fix.
  • Other issues: No additional blocker found beyond mixed-version compatibility.

Review basis: code inspection only; I did not run builds or tests in this review.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/81) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.25% (20340/38198)
Line Coverage 36.79% (191459/520424)
Region Coverage 33.11% (148923/449825)
Branch Coverage 34.19% (65054/190260)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 45.24% (38/84) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.76% (27594/37412)
Line Coverage 57.47% (298198/518848)
Region Coverage 54.66% (248179/454009)
Branch Coverage 56.21% (107281/190850)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 80.39% (41/51) 🎉
Increment coverage report
Complete coverage report

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run performance

1 similar comment
@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run performance

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run performance

1 similar comment
@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run performance

@hello-stephen
Copy link
Copy Markdown
Contributor

skip check_coverage

@luwei16 luwei16 merged commit 2d8b326 into apache:master Apr 21, 2026
33 of 34 checks passed
github-actions Bot pushed a commit that referenced this pull request Apr 21, 2026
… it in cloud mode (#62596)

## Summary

Unify the ``ADMIN COMPACT TABLE ... WHERE type = '...'`` SQL entry so
that BASE, CUMULATIVE and FULL all work on both the local and cloud
(storage-compute split) engines. Before this change, FULL was rejected
in FE and cloud mode was blocked outright, forcing users to curl each
BE's HTTP API to run manual compaction.

## What changes

### FE

- ``AdminCompactTableCommand``: add ``FULL`` to ``CompactionType``,
relax ``analyzeWhere``, map ``FULL -> \"full\"`` in
``getCompactionType``, and drop the cloud-mode gate so the command
dispatches via the normal Agent RPC path.
- ``CloudEnv.compactTable``: override ``Env.compactTable`` to walk
partitions/indices/tablets, resolve the primary BE via
``Replica.getBackendId()``, skip tablets whose compute group has no
available BE, and raise ``DdlException`` if nothing was dispatched.

### BE

- ``submit_table_compaction_callback``: strictly match
``base``/``cumulative``/``full`` instead of silently downgrading any
non-``\"base\"`` value to ``CUMULATIVE_COMPACTION``. FULL goes through
``submit_compaction_task(force=false, eager=true,
trigger_method=MANUAL)`` after setting
``last_full_compaction_schedule_time``, matching the HTTP API default.
- ``cloud_submit_table_compaction_callback``: new callback mirroring the
local one but calling ``CloudStorageEngine::submit_compaction_task``;
honors the cloud HTTP convention that base/cumu need
``sync_delete_bitmap`` while full does not.
- ``agent_server::cloud_start_workers``: register the previously
stubbed-out ``SUBMIT_TABLE_COMPACTION`` worker (drops the ``//
TODO(plat1ko)`` comment).

### Thrift / wire compatibility

``TCompactionReq.type`` remains ``optional string``, so no IDL change. A
new FE against an old BE would see ``\"full\"`` silently downgraded
(local) or dropped (cloud), so the release order is **all BEs upgraded
first, then FE**. FE rollback can lead BE since the FE simply stops
sending ``\"full\"``.

## Test plan

- [x] ``mvn -pl fe-core -am compile`` passes.
- [x] New regression suite
``regression-test/suites/compaction/test_admin_compact_table.groovy``
covers base / cumulative / full end-to-end on the local engine, plus
negative cases (unknown type, missing WHERE).
- [x] New regression suite
``regression-test/suites/cloud_p0/compaction/test_cloud_admin_compact_table.groovy``
covers the same matrix on the cloud engine.
- [x] Design document and code reviewed independently by Codex (two
rounds).
yiguolei pushed a commit that referenced this pull request Apr 23, 2026
Cherry-pick the following PRs to `branch-4.1`:

- #61696 [Feature](compaction) add CompactionTaskTracker with system
table and HTTP API
- #61621 [fix](metrics) Fix prepared statement QPS metrics not counted
when audit log disabled
- #61222 [refactor](compaction) submit manual full compaction task to
thread pool instead of detached thread
- #62596 [feat](compaction) Support ADMIN COMPACT TABLE type='FULL' and
enable it in cloud mode

Conflicts resolved for #61696 (schema table enum renumbering on
branch-4.1; `be_compaction_tasks` uses id 67) and #61621 (AuditLogHelper
context + QueueToken import).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.1-merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants