Skip to content

[AMORO-4253] AIP-5 Phase 1: dynamic allocation configuration and validation#4254

Open
j1wonpark wants to merge 7 commits into
apache:masterfrom
j1wonpark:dra-phase1-config
Open

[AMORO-4253] AIP-5 Phase 1: dynamic allocation configuration and validation#4254
j1wonpark wants to merge 7 commits into
apache:masterfrom
j1wonpark:dra-phase1-config

Conversation

@j1wonpark

Copy link
Copy Markdown
Contributor

Why are the changes needed?

Close #4253.

This is the first implementation phase of AIP-5: Dynamic Resource Allocation for Optimizer (#4191). It lays the configuration foundation that the later scaling phases build on. There is no runtime scaling behavior in this PR — it is purely opt-in configuration plumbing, so it is safe to merge independently and has no effect on existing groups (dynamic-allocation.enabled defaults to false).

Splitting AIP-5 into self-contained PRs follows the design's five linear phases; demand-driven scale-up, idle tracking, scale-down + graceful drain, and the new metrics land in subsequent PRs. The full set of dynamic-allocation.* properties is declared here (rather than drip-fed per phase) so the public configuration surface and its validation are settled in one place: a group created between phases cannot carry an unvalidated DRA property. A few getters (the timeout durations, max-parallelism) therefore have no caller until the phase that consumes them — this is intentional, not dead code.

Brief change log

  • OptimizerProperties: declared the dynamic-allocation.* constants (each with a @since tag) — enabled, min-parallelism, max-parallelism (+ the 1024 hard-cap constant), scheduler-backlog-timeout, sustained-backlog-timeout, executor-idle-timeout (+ its 30s minimum), scale-down-cooldown, drain-timeout. Deprecated the flat min-parallelism in favor of the namespaced dynamic-allocation.min-parallelism; it is still honored as a fallback.

  • DynamicAllocationConfig (new): parse(ResourceGroup) builds a typed, immutable config from the group's properties; validate() enforces the AIP-5 constraints only when DRA is enabledmax-parallelism is mandatory, min-parallelism ≤ max-parallelism ≤ 1024 (an over-limit value is rejected, not silently clamped), executor-idle-timeout ≥ 30s, all durations positive, and DRA is rejected on an externally-registered optimizer (conservative check this phase: the external container, which AMS cannot scale). resolveMinParallelism() centralizes the namespaced → legacy → 0 resolution; a one-off deprecation warning is emitted at config-entry points, not on the keeper hot path.

  • OptimizerGroupController: createResourceGroup / updateResourceGroup now reject an invalid DRA config with HTTP 400 before persisting. (The update path previously had no property-value validation at all.)

  • DefaultOptimizingService:

    • Startup load is fail-safe, never silent: a persisted group with an invalid DRA config no longer crashes AMS — it logs a warning and falls back to DRA-disabled behavior. (The optimizer_group_config_invalid gauge is wired in the observability phase.)
    • getMinParallelism() now delegates to resolveMinParallelism(), preserving the existing lenient behavior while honoring the namespaced key.
    • The keeper's existing min-parallelism auto-reset (after optimizer-group.max-keeping-attempts failed creations) is adjusted for the new property model, per the AIP's Compatibility section:
      • It is skipped when DRA is effectively enabled (opted-in and valid) — DRA owns the group's scale decisions, so the keeper no longer erodes its min-parallelism floor and keeps retrying instead. An invalid config counts as disabled, matching the startup fail-safe.
      • When it does apply, it now writes whichever key the group actually reads (namespaced if present, else the legacy flat key). Previously it always wrote the flat key; for a group using the namespaced property that write was shadowed by resolution order, turning auto-reset into an endless no-op loop (repeated warning + DB update with no effect). This is fixed here.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

    • TestDynamicAllocationConfig: rejection of enabled-without-max-parallelism, max < min, over-1024, sub-30s idle timeout, unparsable duration, zero/non-positive duration, and the external container; correct parsing of a valid config; the namespaced → legacy → 0 resolution order; isEffectivelyEnabled (valid+enabled → true, disabled → false, enabled+invalid → false) and effectiveMinParallelismKey.
    • TestOptimizerGroupController: invalid create/update return 400, valid create succeeds.
    • TestOptimizerGroupKeeper: auto-reset is skipped when DRA is enabled (the floor is preserved); a group using the namespaced key has the reset written to that key and no deprecated flat key is introduced; the existing legacy-group auto-reset scenarios still pass (no regression).
  • Add screenshots for manual tests if appropriate (not applicable — no UI change)

  • Run test locally before making a pull request (affected amoro-common + amoro-ams tests pass — 35 tests green; spotless and checkstyle clean)

Documentation

  • Does this pull request introduce a new feature? (yes — AIP-5 foundation, opt-in and disabled by default)
  • If yes, how is the feature documented? (JavaDocs — @since tags and DynamicAllocationConfig doc comments; the full design is in AIP-5 / [Feature]: Dynamic Resource Allocation for Optimizer #4191, and user-facing configuration docs land with the phase that first exposes runtime behavior)

…dation

First implementation phase of AIP-5 (Dynamic Resource Allocation for
Optimizer, apache#4191). Configuration foundation only; no runtime scaling
behavior. Dynamic allocation is opt-in and disabled by default, so
existing groups are unaffected.

- OptimizerProperties: declare the dynamic-allocation.* constants (with
  @SInCE); deprecate the flat min-parallelism in favor of the namespaced
  dynamic-allocation.min-parallelism, still honored as a fallback.
- DynamicAllocationConfig: parse(ResourceGroup) + validate() enforcing
  the AIP-5 constraints only when enabled (max-parallelism mandatory,
  min <= max <= 1024, executor-idle-timeout >= 30s, positive durations,
  reject the external container). resolveMinParallelism centralizes the
  namespaced -> legacy -> 0 resolution.
- OptimizerGroupController: reject an invalid DRA config with HTTP 400 on
  create/update before persisting.
- DefaultOptimizingService: startup load is fail-safe (invalid config
  logs a warning and falls back to DRA-disabled instead of crashing AMS);
  the keeper min-parallelism auto-reset is skipped when DRA is
  effectively enabled, and otherwise writes the key the group actually
  reads (fixing an endless no-op loop for namespaced groups).

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
@j1wonpark

Copy link
Copy Markdown
Contributor Author

Hi @xxubai @czy006 @zhoujinsong — first PR for AIP-5 (#4191), config-only and disabled by default; would appreciate a review on whether the dynamic-allocation.* config surface and validation look right before later phases build on them. Thanks!

@xxubai

xxubai commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Thanks for your contribution @j1wonpark ! Will take a look soon

resolveMinParallelism() is intentionally lenient (unparsable -> 0) for the
keeper hot path and legacy compatibility, but validate() reused that leniency
for opted-in groups: an enabled group with 'dynamic-allocation.min-parallelism'
set to a non-numeric value silently degraded to 0, and a negative value passed
the max >= min check unchecked. This was asymmetric with max-parallelism, which
is parsed strictly and rejected with HTTP 400.

Retain the raw min-parallelism string at parse time so validate() can tell
"explicitly configured but unparsable" from "unset" (the resolved int collapses
both to 0). When DRA is enabled, an explicitly configured min-parallelism that
is unparsable or negative is now rejected. The keeper hot path stays lenient.

Also document the getMaxParallelism() unboxing precondition (nullable until
validated) and leave a TODO for realistic lower bounds on the polling timeouts.

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
OptimizerProperties.DYNAMIC_ALLOCATION_ENABLED,
OptimizerProperties.DYNAMIC_ALLOCATION_ENABLED_DEFAULT);
int minParallelism = resolveMinParallelism(group);
String minParallelismRaw = rawMinParallelism(group);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need additional raw min-parallelism

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min-parallelism wasn't parsed strictly, so I added a guard for that — keeping both a lenient int and the raw String so validate() could tell "unset" from "explicitly set but malformed" (the lenient int collapses both to 0). That turned out to be overkill.

Simplified by dropping the dual representation: min-parallelism is now a single nullable Integer, parsed strictly in parse() (mirroring max-parallelism), with null meaning "unset".

* min-parallelism}. Intended for config-entry points (startup load, REST create/update), not the
* keeper hot path.
*/
public static void warnDeprecatedMinParallelism(ResourceGroup group) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warnDeprecatedMinParallelism can be included in parse

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it's separate: parse() is also reached off the keeper path via isEffectivelyEnabled() (called in the auto-reset branch), so folding the warning into parse() would re-log the deprecation on every keeper cycle instead of once. warnDeprecatedMinParallelism is meant to fire only at config-entry points (startup load, REST create/update).

I could fold it in by splitting parse() into a warning wrapper + a warn-free core that isEffectivelyEnabled() uses — happy to do that if you prefer the single call site, though it trades one extra line at two callers for an extra private method. Let me know which you'd rather have.

group -> {
String groupName = group.getName();
// Fail-safe: a persisted group carrying an invalid DRA config (e.g. manual DB edits)
// must not crash AMS. Surface it and fall back to DRA-disabled behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a more implicit way to disable it? It doesn’t seem appropriate to validate the group's configuration when loading optimizing queues.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block was meant as a startup fail-safe — the AIP's "never silent" requirement, so a persisted invalid config would be surfaced rather than silently ignored.

But as you pointed out, validating inside the queue loader is the wrong place, and looking closer the block was behaviorally inert: the disable is already implicit (isEffectivelyEnabled() treats an invalid config as disabled wherever DRA is consumed), and in Phase 1 nothing strictly parses DRA config at load, so there was no crash to guard against either. Removed it — surfacing invalid configs as an alertable signal belongs with the metric that lands in the observability phase.

…nteger

Drop the dual int/String representation of min-parallelism in favor of a
single nullable Integer, mirroring max-parallelism: the value is strictly
parsed in parse() (a malformed numeric throws there, honoring parse()'s
contract, instead of being silently degraded to 0 by the lenient keeper
resolver), and null encodes "unset". validate() then only checks the
semantic bounds (non-negative, max >= min).

Also align the trim semantics of the lenient resolveMinParallelism() path
with the strict one so a whitespace-padded value (e.g. " 5 ") that
validate() accepts resolves to the same floor the keeper reads, instead of
silently degrading to 0 at runtime.

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
…error

When min-parallelism is configured only via the deprecated flat key, a
malformed value reported the namespaced 'dynamic-allocation.min-parallelism'
the user never set. Pick the value and its source key together so the error
names the key the value actually came from, folding the former
rawMinParallelism helper into parseMinParallelism.

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
The startup validate/fallback block in loadOptimizingQueues was
behaviorally inert: the disable is already implicit (isEffectivelyEnabled()
treats an invalid config as disabled wherever DRA is consumed), and in
Phase 1 nothing strictly parses DRA config at load, so there was no crash
to guard against. Validating inside the queue loader also mixed an unrelated
concern into it. Surfacing invalid configs as an alertable signal belongs
with the metric that lands in the observability phase.

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
@j1wonpark j1wonpark requested a review from xxubai June 28, 2026 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Subtask]: AIP-5 Phase 1 — Dynamic allocation configuration, validation, and min-parallelism deprecation

2 participants