[AMORO-4253] AIP-5 Phase 1: dynamic allocation configuration and validation#4254
[AMORO-4253] AIP-5 Phase 1: dynamic allocation configuration and validation#4254j1wonpark wants to merge 7 commits into
Conversation
…dation First implementation phase of AIP-5 (Dynamic Resource Allocation for Optimizer, apache#4191). Configuration foundation only; no runtime scaling behavior. Dynamic allocation is opt-in and disabled by default, so existing groups are unaffected. - OptimizerProperties: declare the dynamic-allocation.* constants (with @SInCE); deprecate the flat min-parallelism in favor of the namespaced dynamic-allocation.min-parallelism, still honored as a fallback. - DynamicAllocationConfig: parse(ResourceGroup) + validate() enforcing the AIP-5 constraints only when enabled (max-parallelism mandatory, min <= max <= 1024, executor-idle-timeout >= 30s, positive durations, reject the external container). resolveMinParallelism centralizes the namespaced -> legacy -> 0 resolution. - OptimizerGroupController: reject an invalid DRA config with HTTP 400 on create/update before persisting. - DefaultOptimizingService: startup load is fail-safe (invalid config logs a warning and falls back to DRA-disabled instead of crashing AMS); the keeper min-parallelism auto-reset is skipped when DRA is effectively enabled, and otherwise writes the key the group actually reads (fixing an endless no-op loop for namespaced groups). Signed-off-by: Jiwon Park <jpark92@outlook.kr>
|
Hi @xxubai @czy006 @zhoujinsong — first PR for AIP-5 (#4191), config-only and disabled by default; would appreciate a review on whether the dynamic-allocation.* config surface and validation look right before later phases build on them. Thanks! |
|
Thanks for your contribution @j1wonpark ! Will take a look soon |
resolveMinParallelism() is intentionally lenient (unparsable -> 0) for the keeper hot path and legacy compatibility, but validate() reused that leniency for opted-in groups: an enabled group with 'dynamic-allocation.min-parallelism' set to a non-numeric value silently degraded to 0, and a negative value passed the max >= min check unchecked. This was asymmetric with max-parallelism, which is parsed strictly and rejected with HTTP 400. Retain the raw min-parallelism string at parse time so validate() can tell "explicitly configured but unparsable" from "unset" (the resolved int collapses both to 0). When DRA is enabled, an explicitly configured min-parallelism that is unparsable or negative is now rejected. The keeper hot path stays lenient. Also document the getMaxParallelism() unboxing precondition (nullable until validated) and leave a TODO for realistic lower bounds on the polling timeouts. Signed-off-by: Jiwon Park <jpark92@outlook.kr>
| OptimizerProperties.DYNAMIC_ALLOCATION_ENABLED, | ||
| OptimizerProperties.DYNAMIC_ALLOCATION_ENABLED_DEFAULT); | ||
| int minParallelism = resolveMinParallelism(group); | ||
| String minParallelismRaw = rawMinParallelism(group); |
There was a problem hiding this comment.
Why need additional raw min-parallelism
There was a problem hiding this comment.
min-parallelism wasn't parsed strictly, so I added a guard for that — keeping both a lenient int and the raw String so validate() could tell "unset" from "explicitly set but malformed" (the lenient int collapses both to 0). That turned out to be overkill.
Simplified by dropping the dual representation: min-parallelism is now a single nullable Integer, parsed strictly in parse() (mirroring max-parallelism), with null meaning "unset".
| * min-parallelism}. Intended for config-entry points (startup load, REST create/update), not the | ||
| * keeper hot path. | ||
| */ | ||
| public static void warnDeprecatedMinParallelism(ResourceGroup group) { |
There was a problem hiding this comment.
warnDeprecatedMinParallelism can be included in parse
There was a problem hiding this comment.
The reason it's separate: parse() is also reached off the keeper path via isEffectivelyEnabled() (called in the auto-reset branch), so folding the warning into parse() would re-log the deprecation on every keeper cycle instead of once. warnDeprecatedMinParallelism is meant to fire only at config-entry points (startup load, REST create/update).
I could fold it in by splitting parse() into a warning wrapper + a warn-free core that isEffectivelyEnabled() uses — happy to do that if you prefer the single call site, though it trades one extra line at two callers for an extra private method. Let me know which you'd rather have.
| group -> { | ||
| String groupName = group.getName(); | ||
| // Fail-safe: a persisted group carrying an invalid DRA config (e.g. manual DB edits) | ||
| // must not crash AMS. Surface it and fall back to DRA-disabled behavior. |
There was a problem hiding this comment.
Do you have a more implicit way to disable it? It doesn’t seem appropriate to validate the group's configuration when loading optimizing queues.
There was a problem hiding this comment.
The block was meant as a startup fail-safe — the AIP's "never silent" requirement, so a persisted invalid config would be surfaced rather than silently ignored.
But as you pointed out, validating inside the queue loader is the wrong place, and looking closer the block was behaviorally inert: the disable is already implicit (isEffectivelyEnabled() treats an invalid config as disabled wherever DRA is consumed), and in Phase 1 nothing strictly parses DRA config at load, so there was no crash to guard against either. Removed it — surfacing invalid configs as an alertable signal belongs with the metric that lands in the observability phase.
…nteger Drop the dual int/String representation of min-parallelism in favor of a single nullable Integer, mirroring max-parallelism: the value is strictly parsed in parse() (a malformed numeric throws there, honoring parse()'s contract, instead of being silently degraded to 0 by the lenient keeper resolver), and null encodes "unset". validate() then only checks the semantic bounds (non-negative, max >= min). Also align the trim semantics of the lenient resolveMinParallelism() path with the strict one so a whitespace-padded value (e.g. " 5 ") that validate() accepts resolves to the same floor the keeper reads, instead of silently degrading to 0 at runtime. Signed-off-by: Jiwon Park <jpark92@outlook.kr>
…error When min-parallelism is configured only via the deprecated flat key, a malformed value reported the namespaced 'dynamic-allocation.min-parallelism' the user never set. Pick the value and its source key together so the error names the key the value actually came from, folding the former rawMinParallelism helper into parseMinParallelism. Signed-off-by: Jiwon Park <jpark92@outlook.kr>
The startup validate/fallback block in loadOptimizingQueues was behaviorally inert: the disable is already implicit (isEffectivelyEnabled() treats an invalid config as disabled wherever DRA is consumed), and in Phase 1 nothing strictly parses DRA config at load, so there was no crash to guard against. Validating inside the queue loader also mixed an unrelated concern into it. Surfacing invalid configs as an alertable signal belongs with the metric that lands in the observability phase. Signed-off-by: Jiwon Park <jpark92@outlook.kr>
Why are the changes needed?
Close #4253.
This is the first implementation phase of AIP-5: Dynamic Resource Allocation for Optimizer (#4191). It lays the configuration foundation that the later scaling phases build on. There is no runtime scaling behavior in this PR — it is purely opt-in configuration plumbing, so it is safe to merge independently and has no effect on existing groups (
dynamic-allocation.enableddefaults tofalse).Splitting AIP-5 into self-contained PRs follows the design's five linear phases; demand-driven scale-up, idle tracking, scale-down + graceful drain, and the new metrics land in subsequent PRs. The full set of
dynamic-allocation.*properties is declared here (rather than drip-fed per phase) so the public configuration surface and its validation are settled in one place: a group created between phases cannot carry an unvalidated DRA property. A few getters (the timeout durations,max-parallelism) therefore have no caller until the phase that consumes them — this is intentional, not dead code.Brief change log
OptimizerProperties: declared thedynamic-allocation.*constants (each with a@sincetag) —enabled,min-parallelism,max-parallelism(+ the1024hard-cap constant),scheduler-backlog-timeout,sustained-backlog-timeout,executor-idle-timeout(+ its30sminimum),scale-down-cooldown,drain-timeout. Deprecated the flatmin-parallelismin favor of the namespaceddynamic-allocation.min-parallelism; it is still honored as a fallback.DynamicAllocationConfig(new):parse(ResourceGroup)builds a typed, immutable config from the group's properties;validate()enforces the AIP-5 constraints only when DRA is enabled —max-parallelismis mandatory,min-parallelism ≤ max-parallelism ≤ 1024(an over-limit value is rejected, not silently clamped),executor-idle-timeout ≥ 30s, all durations positive, and DRA is rejected on an externally-registered optimizer (conservative check this phase: theexternalcontainer, which AMS cannot scale).resolveMinParallelism()centralizes the namespaced → legacy →0resolution; a one-off deprecation warning is emitted at config-entry points, not on the keeper hot path.OptimizerGroupController:createResourceGroup/updateResourceGroupnow reject an invalid DRA config with HTTP 400 before persisting. (The update path previously had no property-value validation at all.)DefaultOptimizingService:optimizer_group_config_invalidgauge is wired in the observability phase.)getMinParallelism()now delegates toresolveMinParallelism(), preserving the existing lenient behavior while honoring the namespaced key.min-parallelismauto-reset (afteroptimizer-group.max-keeping-attemptsfailed creations) is adjusted for the new property model, per the AIP's Compatibility section:min-parallelismfloor and keeps retrying instead. An invalid config counts as disabled, matching the startup fail-safe.How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
TestDynamicAllocationConfig: rejection of enabled-without-max-parallelism,max < min, over-1024, sub-30sidle timeout, unparsable duration, zero/non-positive duration, and theexternalcontainer; correct parsing of a valid config; the namespaced → legacy →0resolution order;isEffectivelyEnabled(valid+enabled → true, disabled → false, enabled+invalid → false) andeffectiveMinParallelismKey.TestOptimizerGroupController: invalid create/update return 400, valid create succeeds.TestOptimizerGroupKeeper: auto-reset is skipped when DRA is enabled (the floor is preserved); a group using the namespaced key has the reset written to that key and no deprecated flat key is introduced; the existing legacy-group auto-reset scenarios still pass (no regression).Add screenshots for manual tests if appropriate (not applicable — no UI change)
Run test locally before making a pull request (affected
amoro-common+amoro-amstests pass — 35 tests green; spotless and checkstyle clean)Documentation
@sincetags andDynamicAllocationConfigdoc comments; the full design is in AIP-5 / [Feature]: Dynamic Resource Allocation for Optimizer #4191, and user-facing configuration docs land with the phase that first exposes runtime behavior)