docs(adr): leader-priority raft config change basement#437
Open
xiaoxichen wants to merge 1 commit into
Open
Conversation
Proposes the internal design for a CM-driven leader-priority change API: PG sb as source of truth for CM intent, a new HS_CTRL_SET_LEADER_PRIORITY journal entry to propagate intent to every replica, and an enhanced reconcile_leader as the convergence engine driving NuRaft cluster_config to match PG sb. Includes a defense-in-depth fix for the replace_member PG-sb divergence (on_pg_start_replace_member propagates out's priority to in to preserve leader intent across membership swaps). Status: Proposed Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
yuwmao
reviewed
Jun 29, 2026
|
|
||
| A new `journal_type_t::HS_CTRL_SET_LEADER_PRIORITY` value. The header payload | ||
| carries: | ||
| - `task_id` (string) — for caller-side idempotency and tracking, matching the |
Contributor
There was a problem hiding this comment.
Do we also need a task sb to persist intent?
| | Failure point | Outcome | | ||
| | --- | --- | | ||
| | Leader crash before Piece 2 propose returns | CM retries (idempotent on `task_id`). No state change. | | ||
| | Leader crash after journal entry committed on majority but before on_commit handler runs on the leader | The entry is replayed during recovery; on_commit fires; new leader's reconcile_leader converges NuRaft. | |
Contributor
There was a problem hiding this comment.
The join_group is called after log replay, not sure whether on_commit triggering reconcile_leader works or not.
| | --- | --- | | ||
| | Leader crash before Piece 2 propose returns | CM retries (idempotent on `task_id`). No state change. | | ||
| | Leader crash after journal entry committed on majority but before on_commit handler runs on the leader | The entry is replayed during recovery; on_commit fires; new leader's reconcile_leader converges NuRaft. | | ||
| | Leader crash between on_commit (Piece 3) and reconcile_leader completion (Piece 4) | PG sb is durable on majority; NuRaft cluster_config may be partly unconverged. Any subsequent reconcile_leader trigger (HTTP, CM retry, future periodic thread) re-derives desired state from PG sb and finishes the job. Transient inconsistency is bounded by reconcile trigger cadence. | |
Contributor
There was a problem hiding this comment.
How do we observe the pg sb priority and cluster_config priority?
| 4. Transient divergence between durable intent and live NuRaft state is bounded and | ||
| recoverable without operator intervention. | ||
|
|
||
| This basement is the substrate for a separate gRPC API (`CmLeaderChange` → |
Contributor
There was a problem hiding this comment.
Actually we need to define a new gRPC, CmLeaderChange is used to notify CM's leader change, although it's not in use.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposes the internal design for a CM-driven leader-priority change API: PG sb as source of truth for CM intent, a new HS_CTRL_SET_LEADER_PRIORITY journal entry to propagate intent to every replica, and an enhanced reconcile_leader as the convergence engine driving NuRaft cluster_config to match PG sb. Includes a defense-in-depth fix for the replace_member PG-sb divergence (on_pg_start_replace_member propagates out's priority to in to preserve leader intent across membership swaps).
Status: Proposed