Skip to content

consensus missed round metric#119

Open
Brindrajsinh-Chauhan wants to merge 1 commit into
circlefin:mainfrom
kaleido-io:measure-consensus-rounds-missed
Open

consensus missed round metric#119
Brindrajsinh-Chauhan wants to merge 1 commit into
circlefin:mainfrom
kaleido-io:measure-consensus-rounds-missed

Conversation

@Brindrajsinh-Chauhan

@Brindrajsinh-Chauhan Brindrajsinh-Chauhan commented Jun 7, 2026

Copy link
Copy Markdown

This adds a metrics to measure the missed rounds with labels including the proposer height and the round missed. This is helpful to create alerting rules for missed rounds and identify validators that might be having issues.

Screenshot 2026-06-07 at 7 36 13 PM

@ZhiyuCircle ZhiyuCircle added the pending-import Merged PR awaiting reverse-sync to upstream label Jun 9, 2026
@ZhiyuCircle

Copy link
Copy Markdown
Contributor

@Brindrajsinh-Chauhan thanks for contribution, could you fix the rust formatting/lint

Signed-off-by: Brindrajsinh-Chauhan <brindrajsinh@gmail.com>
@Brindrajsinh-Chauhan Brindrajsinh-Chauhan force-pushed the measure-consensus-rounds-missed branch from bcaccc9 to 66b9904 Compare June 9, 2026 16:48
@Brindrajsinh-Chauhan

Copy link
Copy Markdown
Author

@ZhiyuCircle Would be grateful to know the next step to get this in or will this be first merged in internal repos and then reverse synced to upstream

Comment on lines +550 to +555
#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq, EncodeLabelSet)]
struct RoundMissedLabel {
proposer: AsLabelValue<Address>,
height: AsLabelValue<u64>,
round: AsLabelValue<i64>,
}

@romac romac Jun 11, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that height grows forever and round can spike during outages, this creates one time series per missed height/round/proposer for the process lifetime, which can grow unbounded very quickly if eg. a validator is down for an extended period of time.

We have actually seen such an issue on testnet with another metric which was recording outbound IP address with their ephemeral ports, causing problems when ingesting those metrics into Prometheus/Datadog.

I am not sure what the best solution here is, since I agree that having the height and round is useful information.


Curious to hear what you think!

In the meantime, I'll see if I can find a good solution for this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback, I will also think on this to find a solution for this

Comment on lines +100 to +111
if round.as_i64() > 0 {
let current_round = round.as_u32().expect("round is defined");
let missed_round = Round::new(current_round.saturating_sub(1));
let missed_proposer = state
.ctx
.proposer_selector
.select_proposer(state.validator_set(), height, missed_round)
.address;
state
.metrics()
.inc_consensus_round_missed(missed_proposer, height, missed_round);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This undercounts missed rounds if consensus jumps from round 0 to round N via Tendermint skip-round handling.

Suggested change
if round.as_i64() > 0 {
let current_round = round.as_u32().expect("round is defined");
let missed_round = Round::new(current_round.saturating_sub(1));
let missed_proposer = state
.ctx
.proposer_selector
.select_proposer(state.validator_set(), height, missed_round)
.address;
state
.metrics()
.inc_consensus_round_missed(missed_proposer, height, missed_round);
}
let mut missed_round = state.current_round.or(Round::ZERO);
while missed_round < round {
let missed_proposer = state
.ctx
.proposer_selector
.select_proposer(state.validator_set(), height, missed_round)
.address;
warn!(%height, %missed_round, %missed_proposer, "Consensus round missed");
state
.metrics()
.inc_consensus_round_missed(missed_proposer, height, missed_round);
missed_round = missed_round.increment();
}

@ZhiyuCircle

Copy link
Copy Markdown
Contributor

@ZhiyuCircle Would be grateful to know the next step to get this in or will this be first merged in internal repos and then reverse synced to upstream

@Brindrajsinh-Chauhan after second pass review, we have few concerns like @romac mention above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pending-import Merged PR awaiting reverse-sync to upstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants