improve raft test framework#444
Conversation
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## stable/v4.x #444 +/- ##
==============================================
Coverage ? 54.29%
==============================================
Files ? 36
Lines ? 5424
Branches ? 684
==============================================
Hits ? 2945
Misses ? 2179
Partials ? 300 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce replication-unit-test hangs caused by unexpected Raft leader switches by adding retry-aware “run on leader” helpers and updating test fixture operations to be resilient to leadership churn.
Changes:
- Introduces
run_on_pg_leader_with_retryand “not leader” error classification helpers to retry leader-only ops until completion or timeout. - Updates shard/blob test-fixture operations (create shard, seal shard, put/delete blobs) to use retry logic and idempotent completion checks.
- Bumps Conan package version to
4.1.23.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/lib/homestore_backend/tests/homeobj_fixture.hpp |
Adds leader-retry helper + updates shard/blob fixture operations to tolerate leader switches and avoid deadlocks. |
conanfile.py |
Version bump to 4.1.23. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
TL, DR: |
491c05f to
63dcbed
Compare
|
it`s not related with reconciling leadership. what I want in this PR is that the op(for example, put_blob) is eventually be executed by a replica and will not be missed even if unexpected leader switch happens. From another perspective, if three replicas all think they are not leader and stuck at waiting for some op(for example, blob_exist), who should schedule reconciling leadership and how to reconcile leadership(replicas are all stuck now). |
63dcbed to
66213d6
Compare
occasionally,we can see CI is stuck at homestore_test_pg/shard/blob。 the root cause is unexpected leader switch.
follower will wait for something to happen, but leader think it is not leader any more( because of leader switch) and do not schedule some op, then all the member will sync and wait at some point, and thus the UT is stuck.
This PR try to add more retry and avoid this case