consensus missed round metric#119
Conversation
|
@Brindrajsinh-Chauhan thanks for contribution, could you fix the rust formatting/lint |
Signed-off-by: Brindrajsinh-Chauhan <brindrajsinh@gmail.com>
bcaccc9 to
66b9904
Compare
|
@ZhiyuCircle Would be grateful to know the next step to get this in or will this be first merged in internal repos and then reverse synced to upstream |
| #[derive(Clone, Copy, Debug, Hash, PartialEq, Eq, EncodeLabelSet)] | ||
| struct RoundMissedLabel { | ||
| proposer: AsLabelValue<Address>, | ||
| height: AsLabelValue<u64>, | ||
| round: AsLabelValue<i64>, | ||
| } |
There was a problem hiding this comment.
Given that height grows forever and round can spike during outages, this creates one time series per missed height/round/proposer for the process lifetime, which can grow unbounded very quickly if eg. a validator is down for an extended period of time.
We have actually seen such an issue on testnet with another metric which was recording outbound IP address with their ephemeral ports, causing problems when ingesting those metrics into Prometheus/Datadog.
I am not sure what the best solution here is, since I agree that having the height and round is useful information.
Curious to hear what you think!
In the meantime, I'll see if I can find a good solution for this.
There was a problem hiding this comment.
Thank you for the feedback, I will also think on this to find a solution for this
| if round.as_i64() > 0 { | ||
| let current_round = round.as_u32().expect("round is defined"); | ||
| let missed_round = Round::new(current_round.saturating_sub(1)); | ||
| let missed_proposer = state | ||
| .ctx | ||
| .proposer_selector | ||
| .select_proposer(state.validator_set(), height, missed_round) | ||
| .address; | ||
| state | ||
| .metrics() | ||
| .inc_consensus_round_missed(missed_proposer, height, missed_round); | ||
| } |
There was a problem hiding this comment.
This undercounts missed rounds if consensus jumps from round 0 to round N via Tendermint skip-round handling.
| if round.as_i64() > 0 { | |
| let current_round = round.as_u32().expect("round is defined"); | |
| let missed_round = Round::new(current_round.saturating_sub(1)); | |
| let missed_proposer = state | |
| .ctx | |
| .proposer_selector | |
| .select_proposer(state.validator_set(), height, missed_round) | |
| .address; | |
| state | |
| .metrics() | |
| .inc_consensus_round_missed(missed_proposer, height, missed_round); | |
| } | |
| let mut missed_round = state.current_round.or(Round::ZERO); | |
| while missed_round < round { | |
| let missed_proposer = state | |
| .ctx | |
| .proposer_selector | |
| .select_proposer(state.validator_set(), height, missed_round) | |
| .address; | |
| warn!(%height, %missed_round, %missed_proposer, "Consensus round missed"); | |
| state | |
| .metrics() | |
| .inc_consensus_round_missed(missed_proposer, height, missed_round); | |
| missed_round = missed_round.increment(); | |
| } |
@Brindrajsinh-Chauhan after second pass review, we have few concerns like @romac mention above |
This adds a metrics to measure the missed rounds with labels including the proposer height and the round missed. This is helpful to create alerting rules for missed rounds and identify validators that might be having issues.