[feat](compaction) Support ADMIN COMPACT TABLE type='FULL' and enable it in cloud mode#62596
Conversation
… it in cloud mode ADMIN COMPACT TABLE previously only accepted BASE and CUMULATIVE in local mode and threw Unsupported operation in cloud mode, forcing users to curl each BE's HTTP API to run manual compaction. This change unifies the FE SQL entry so that base/cumulative/full all work on both local and cloud. FE - AdminCompactTableCommand: add FULL to CompactionType, relax analyzeWhere, map FULL->"full" in getCompactionType, drop the cloud-mode gate so the command dispatches via the normal path. - CloudEnv.compactTable: override Env.compactTable to walk partitions/ indices/tablets, resolve the primary BE via Replica.getBackendId, and raise a DdlException if nothing was dispatched (e.g. the current compute group has no available BE). BE - submit_table_compaction_callback: strictly match base/cumulative/full and reject unknown types instead of silently downgrading non-"base" to cumulative. Full goes through submit_compaction_task(force=false, eager=true, trigger_method=MANUAL) after setting last_full_compaction_schedule_time, matching the HTTP API default. - cloud_submit_table_compaction_callback: new callback mirroring the local one but calling CloudStorageEngine::submit_compaction_task; honors the cloud HTTP convention that base/cumu need sync_delete_bitmap while full does not. - agent_server::cloud_start_workers: register the previously stubbed-out SUBMIT_TABLE_COMPACTION worker (drops the plat1ko TODO). Thrift is untouched: TCompactionReq.type is already optional string, so the existing "base"/"cumulative"/"full" wire format remains compatible. This implies a release order of "all BEs upgraded first, then FE" - a new FE against an old BE would see "full" silently downgraded (local) or dropped (cloud). Tests - regression-test/suites/compaction/test_admin_compact_table.groovy: end-to-end coverage for base/cumulative/full under the local engine plus negative cases (unknown type, missing WHERE). - regression-test/suites/cloud_p0/compaction/test_cloud_admin_compact_table.groovy: same matrix under the cloud engine, with a SELECT before each poll to force BE to sync rowsets from Meta Service.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…ack changes Reflow the new submit_compaction_task calls and the cloud callback's log messages to satisfy clang-format.
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run nonConcurrent |
|
/review |
There was a problem hiding this comment.
Summary
Requesting changes for two functional blockers in the new cloud FE path and one rolling-upgrade compatibility blocker.
Findings
CloudEnv.compactTable()resolves cloud backends whileolapTable.readLock()is held. In cloud modeCloudReplica.getBackendId()can auto-start/wait for a compute group via Meta Service, so this can hold the table metadata lock for seconds or minutes.CloudEnv.compactTable()now swallows everyUserExceptionfrom backend resolution and degrades it toskip tablet ..., which hides real user-facing failures such as missingUSAGEprivilege, manual-shutdown compute groups, or missing compute groups.- The new
FULL/ cloud SQL path is not rolling-upgrade safe. Old local BEs still treatfullas cumulative compaction, and old cloud BEs still rejectTTaskType::COMPACTION; FE does not surface that submit failure.
Critical Checkpoints
- Goal of current task / correctness: Partially met. The PR adds
FULLand cloud SQL coverage on fully upgraded clusters, but the cloud FE path and mixed-version behavior still have blocking issues. Regression tests cover the happy path only. - Scope / minimality: Mostly focused, but the cloud FE override folds backend resolution, auto-start behavior, and dispatch into one locked section, which expands the risk surface.
- Concurrency / locking: BE changes look consistent with existing compaction queues. FE has a locking issue:
CloudEnv.compactTable()performs compute-group resolution underolapTable.readLock(), and that path can wait on external state. - Lifecycle / static init: No special lifecycle or static-init problems found in the changed code.
- Config changes: None.
- Compatibility / rolling upgrade: Not safe today. New FE can send requests that old BEs either execute as the wrong compaction type (local) or reject silently (cloud).
- Parallel code paths: I checked both local/cloud SQL paths and the existing HTTP compaction paths. The cloud SQL path mirrors the cloud HTTP BE behavior better than before, but the FE-side error handling and compatibility gates are still incomplete.
- Special conditions / error handling: The new catch-all
UserExceptionhandling is too broad and loses actionable user errors. - Test coverage: New regression tests are good for fully upgraded local/cloud happy paths and basic syntax errors, but they do not cover cloud failure cases (permission/manual shutdown/no BE) or mixed-version behavior.
- Test result files: The new
.outfiles are consistent with the new suites. - Observability: BE logs are reasonable, but FE currently suppresses the most useful cloud-side error context before it reaches the user.
- Transaction / persistence / data writes: No new edit-log or storage-format changes. The risk here is dispatch correctness, not persistence.
- FE-BE parameter passing: Reuses existing
TCompactionReq.typestring; no new wire field. - Performance: The main performance/concurrency concern is repeated backend resolution and potential auto-start wait under the table read lock.
- Other issues: No additional blockers beyond the points above.
Two follow-ups on the AI review: 1. Do not hold the table read lock while resolving backends. CloudReplica.getBackendId() can call into Meta Service and wait for compute-group auto-start, which could block unrelated DDL on this table for the full wait. Split the method into two passes: collect tablet metadata (partition/index/tablet/schemaHash/replica refs) under the read lock, then release it and resolve backends outside. 2. Do not swallow UserException as a skip. The previous catch dropped real errors (missing USAGE privilege, manually shut down compute group, missing compute group, ...) to a LOG.warn, making failed ADMIN COMPACT TABLE look like a no-op. Now any UserException raised during backend resolution is surfaced to the user as DdlException. Compile-only verified; existing regression tests cover the happy path.
|
run buildall |
|
/review |
There was a problem hiding this comment.
Summary
Requesting changes for one remaining blocker in the current head revision. The cloud lock-scope and error-propagation problems from the previous revision are fixed, but the new FE SQL entry is still not mixed-version safe against older BEs.
Findings
- Mixed-version compatibility blocker: enabling
FULLand cloud SQL dispatch from FE is still unsafe before every BE is upgraded. On the previous local BE implementation,submit_table_compaction_callback()treats any non-"base"type as cumulative compaction, so a new FE can silently run the wrong operation fortype='FULL'. On the previous cloud BE implementation,cloud_start_workers()still has noTTaskType::COMPACTIONworker;submit_tasks()returns a non-OKTAgentResult, butAgentBatchTask.run()ignores that result, so the SQL can return success while nothing is actually dispatched.
Critical Checkpoints
- Goal of current task / correctness: Mostly met on a fully upgraded cluster. The new local/cloud paths line up with the intended behavior there, and the added regression suites cover the happy path. It is not fully correct across mixed FE/BE versions because a new FE can still send requests that old BEs execute incorrectly or reject silently.
- Scope / minimality: Focused. The current head also keeps the cloud backend-resolution work outside the table read lock, which is the right shape.
- Concurrency / locking: No remaining locking or thread-safety problem found in the touched paths after the latest cloud fix.
- Lifecycle / static init: No special lifecycle or static-initialization issue found.
- Config changes: None.
- Compatibility / rolling upgrade: Not satisfied; see the finding above.
- Parallel code paths: I checked both local/cloud SQL paths and the existing HTTP compaction paths. They now match on new BEs, but the mixed-version behavior is still wrong.
- Special conditions / error handling: Current head correctly surfaces cloud backend-resolution failures instead of swallowing them.
- Test coverage: The new regression tests are good for fully upgraded local/cloud happy paths and syntax negatives, but they do not cover mixed-version behavior.
- Test result files: The added
.outfiles are consistent with the new suites. - Observability: Adequate for the touched paths.
- Transaction / persistence / data writes: No persistence or transaction-correctness issue found in this change.
- FE-BE parameter passing: Reuses
TCompactionReq.type; the remaining problem is how older BEs interpret or reject the new FE behavior. - Performance: No remaining performance blocker after the cloud lock-scope fix.
- Other issues: No additional blocker found beyond mixed-version compatibility.
Review basis: code inspection only; I did not run builds or tests in this review.
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run performance |
1 similar comment
|
run performance |
|
PR approved by at least one committer and no changes requested. |
|
run performance |
1 similar comment
|
run performance |
|
skip check_coverage |
… it in cloud mode (#62596) ## Summary Unify the ``ADMIN COMPACT TABLE ... WHERE type = '...'`` SQL entry so that BASE, CUMULATIVE and FULL all work on both the local and cloud (storage-compute split) engines. Before this change, FULL was rejected in FE and cloud mode was blocked outright, forcing users to curl each BE's HTTP API to run manual compaction. ## What changes ### FE - ``AdminCompactTableCommand``: add ``FULL`` to ``CompactionType``, relax ``analyzeWhere``, map ``FULL -> \"full\"`` in ``getCompactionType``, and drop the cloud-mode gate so the command dispatches via the normal Agent RPC path. - ``CloudEnv.compactTable``: override ``Env.compactTable`` to walk partitions/indices/tablets, resolve the primary BE via ``Replica.getBackendId()``, skip tablets whose compute group has no available BE, and raise ``DdlException`` if nothing was dispatched. ### BE - ``submit_table_compaction_callback``: strictly match ``base``/``cumulative``/``full`` instead of silently downgrading any non-``\"base\"`` value to ``CUMULATIVE_COMPACTION``. FULL goes through ``submit_compaction_task(force=false, eager=true, trigger_method=MANUAL)`` after setting ``last_full_compaction_schedule_time``, matching the HTTP API default. - ``cloud_submit_table_compaction_callback``: new callback mirroring the local one but calling ``CloudStorageEngine::submit_compaction_task``; honors the cloud HTTP convention that base/cumu need ``sync_delete_bitmap`` while full does not. - ``agent_server::cloud_start_workers``: register the previously stubbed-out ``SUBMIT_TABLE_COMPACTION`` worker (drops the ``// TODO(plat1ko)`` comment). ### Thrift / wire compatibility ``TCompactionReq.type`` remains ``optional string``, so no IDL change. A new FE against an old BE would see ``\"full\"`` silently downgraded (local) or dropped (cloud), so the release order is **all BEs upgraded first, then FE**. FE rollback can lead BE since the FE simply stops sending ``\"full\"``. ## Test plan - [x] ``mvn -pl fe-core -am compile`` passes. - [x] New regression suite ``regression-test/suites/compaction/test_admin_compact_table.groovy`` covers base / cumulative / full end-to-end on the local engine, plus negative cases (unknown type, missing WHERE). - [x] New regression suite ``regression-test/suites/cloud_p0/compaction/test_cloud_admin_compact_table.groovy`` covers the same matrix on the cloud engine. - [x] Design document and code reviewed independently by Codex (two rounds).
Cherry-pick the following PRs to `branch-4.1`: - #61696 [Feature](compaction) add CompactionTaskTracker with system table and HTTP API - #61621 [fix](metrics) Fix prepared statement QPS metrics not counted when audit log disabled - #61222 [refactor](compaction) submit manual full compaction task to thread pool instead of detached thread - #62596 [feat](compaction) Support ADMIN COMPACT TABLE type='FULL' and enable it in cloud mode Conflicts resolved for #61696 (schema table enum renumbering on branch-4.1; `be_compaction_tasks` uses id 67) and #61621 (AuditLogHelper context + QueueToken import).
Summary
Unify the
ADMIN COMPACT TABLE ... WHERE type = '...'SQL entry so that BASE, CUMULATIVE and FULL all work on both the local and cloud (storage-compute split) engines. Before this change, FULL was rejected in FE and cloud mode was blocked outright, forcing users to curl each BE's HTTP API to run manual compaction.What changes
FE
AdminCompactTableCommand: addFULLtoCompactionType, relaxanalyzeWhere, mapFULL -> \"full\"ingetCompactionType, and drop the cloud-mode gate so the command dispatches via the normal Agent RPC path.CloudEnv.compactTable: overrideEnv.compactTableto walk partitions/indices/tablets, resolve the primary BE viaReplica.getBackendId(), skip tablets whose compute group has no available BE, and raiseDdlExceptionif nothing was dispatched.BE
submit_table_compaction_callback: strictly matchbase/cumulative/fullinstead of silently downgrading any non-\"base\"value toCUMULATIVE_COMPACTION. FULL goes throughsubmit_compaction_task(force=false, eager=true, trigger_method=MANUAL)after settinglast_full_compaction_schedule_time, matching the HTTP API default.cloud_submit_table_compaction_callback: new callback mirroring the local one but callingCloudStorageEngine::submit_compaction_task; honors the cloud HTTP convention that base/cumu needsync_delete_bitmapwhile full does not.agent_server::cloud_start_workers: register the previously stubbed-outSUBMIT_TABLE_COMPACTIONworker (drops the// TODO(plat1ko)comment).Thrift / wire compatibility
TCompactionReq.typeremainsoptional string, so no IDL change. A new FE against an old BE would see\"full\"silently downgraded (local) or dropped (cloud), so the release order is all BEs upgraded first, then FE. FE rollback can lead BE since the FE simply stops sending\"full\".Test plan
mvn -pl fe-core -am compilepasses.regression-test/suites/compaction/test_admin_compact_table.groovycovers base / cumulative / full end-to-end on the local engine, plus negative cases (unknown type, missing WHERE).regression-test/suites/cloud_p0/compaction/test_cloud_admin_compact_table.groovycovers the same matrix on the cloud engine.