Skip to content

[CELEBORN-2334] Automatically restore RocksDB in case of failures#3695

Open
AmandeepSingh285 wants to merge 12 commits into
apache:mainfrom
AmandeepSingh285:auto-recover-rocks-db
Open

[CELEBORN-2334] Automatically restore RocksDB in case of failures#3695
AmandeepSingh285 wants to merge 12 commits into
apache:mainfrom
AmandeepSingh285:auto-recover-rocks-db

Conversation

@AmandeepSingh285

@AmandeepSingh285 AmandeepSingh285 commented May 18, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

The patch re-instantiates RocksDB in case of failures. In the current implementation, when RocksDB enters a read-only mode due to failures, Celeborn metadata operations fail and remain blocked until manual intervention or restart. This pull request adds logic to detect such RocksDB failures and re-instantiate the RocksDB instance so that metadata operations can recover automatically and continue functioning without prolonged disruption. RocksDB can enter a read-only or unusable state under scenarios such as: corruption in files, errors from underlying file system. In such cases, RocksDB prevents further writes to protect data consistency, which causes Celeborn metadata operations to fail.

Why are the changes needed?

Once RocksDB enters a read-only or error state, Celeborn metadata operations become unavailable because the existing RocksDB instance remains unusable, which could lead to failures in metadata updates.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

@SteNicholas SteNicholas changed the title [CELEBORN-2334] Auto-recover rocks db in case of failures [CELEBORN-2334] Auto-recover RocksDB in case of failures May 19, 2026
@SteNicholas SteNicholas changed the title [CELEBORN-2334] Auto-recover RocksDB in case of failures [CELEBORN-2334] Automatically restore RocksDB in case of failures May 19, 2026
@SteNicholas SteNicholas requested a review from Copilot May 19, 2026 03:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds automatic recovery for the worker's RocksDB-backed metadata store: when a DB operation throws RocksDBException, the RocksDB wrapper closes the existing native instance and reopens it via RocksDBProvider.initRockDB, coordinated by a ReentrantReadWriteLock and a generation counter so only one thread performs the reopen.

Changes:

  • RocksDB now holds a volatile db reference plus dbFile/version and exposes a recreateDBInstance(generation) helper invoked from each operation's catch block.
  • All putInternal/getInternal/deleteInternal/newIterator/close methods are wrapped in read-/write-lock scopes around the generation snapshot.
  • DBProvider.initDB passes dbFile and version through to the new RocksDB constructor.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDB.java Adds locking, generation counter, and recreateDBInstance recovery in each DB operation.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProvider.java Threads dbFile/version into the updated RocksDB constructor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions Bot added correctness Correctness bugfix and removed correctness Correctness bugfix labels May 19, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Comment thread docs/configuration/worker.md Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

@SteNicholas

Copy link
Copy Markdown
Member

@AmandeepSingh285, please update the ConfigOption to align with the changes of configuration/worker.md for failure of CI.

@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

@AmandeepSingh285, thanks for updates. I have left minor comments. PTAL.

Thanks @SteNicholas , made changes according to the comments

@RexXiong

Copy link
Copy Markdown
Contributor

Thanks for the work on this PR — the overall design (ManagedRocksDB lifecycle wrapper, generation + ReadWriteLock for concurrent recovery dedup, stale iterator detection) is solid. A couple of observations:

1. tryRecoverDBInstance: failed reopen leaves DB in an infinite recovery loop

If reopenRocksDB throws, db still references the already-closed ManagedRocksDB, and the generation has already been incremented:

try {
    if (db != null) {
        db.close();           // old DB closed
    }
} catch (Exception e) { ... }

dbGeneration.incrementAndGet();   // generation bumped

try {
    db = RocksDBProvider.reopenRocksDB(dbFile, conf);  // if this fails...
} catch (IOException e) {
    logger.error("Safe reopen failed ...", e);
    // db still points to the CLOSED ManagedRocksDB
    // generation already incremented → dedup won't block next attempt
}

Every subsequent operation will: use the closed DB → throw → trigger a new recovery (passes the dedup check since generation advanced) → close the already-closed DB → attempt reopen → fail again. Each operation pays the cost of a write-lock acquisition + a full reopen attempt, and it never converges.

Suggestion: mark the DB as terminally failed after reopen failure so that checkState() fast-fails with a clear message (e.g., "DB recovery failed, manual intervention required") instead of looping:

} catch (IOException e) {
    logger.error("Safe reopen failed for RocksDB at {}.", dbFile, e);
    db = null;
    closed = true;  // or a dedicated recoveryFailed flag
}

2. Minor: testNewIteratorReturnsUsableIterator assertion is fragile

The test puts 2 entries ("a", "b") but asserts seen >= 3, relying on the version entry written internally by initRockDB. If the version storage mechanism changes, this test will break. Consider asserting seen >= 2 or using an exact count.

Reviewed with Claude Code

@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Thanks @RexXiong for the review.

  1. Recovery after failed reopen - The goal of the change was to have RocksDB automatically recovered rather than entering a terminal state. In the current state as well, in case of any failures when RocksDB enters a Read only mode, a manual intervention is required to have metadata updates unblocked which I tried to solve through this PR.
  2. This is true that the test is relying on the version entry written. I can update the same with a small change if required.

@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

@RexXiong have updated the test with the commented changes. Could you please help with review

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

@SteNicholas SteNicholas requested a review from RexXiong June 8, 2026 19:23
@SteNicholas

Copy link
Copy Markdown
Member

Ping @RexXiong, PTAL.

@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Hi @RexXiong , gentle reminder for review. Thanks!

@RexXiong RexXiong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, I left a minor comment.

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Thanks @SteNicholas , @RexXiong for the review. Have made changes based on the comments.

@SteNicholas SteNicholas requested a review from RexXiong June 16, 2026 07:21
Comment thread docs/configuration/worker.md Outdated
| celeborn.master.endpoints.resolver | org.apache.celeborn.common.client.StaticMasterEndpointResolver | false | Resolver class that can be used for discovering and updating the master endpoints. This allows users to provide a custom master endpoint resolver implementation. This is useful in environments where the master nodes might change due to scaling operations or infrastructure updates. Clients need to ensure that provided resolver class should be present in the classpath. | 0.5.2 | |
| celeborn.master.estimatedPartitionSize.minSize | 8mb | false | Ignore partition size smaller than this configuration of partition size for estimation. | 0.3.0 | celeborn.shuffle.minPartitionSizeToEstimate |
| celeborn.master.internal.endpoints | &lt;localhost&gt;:8097 | false | Endpoints of master nodes just for celeborn workers to connect, allowed pattern is: `<host1>:<port1>[,<host2>:<port2>]*`, e.g. `clb1:8097,clb2:8097,clb3:8097`. If the port is omitted, 8097 will be used. | 0.5.0 | |
| celeborn.worker.graceful.shutdown.recoverDb.rocksdb.autoRecovery.enabled | false | false | If true, the metadata DB will automatically attempt to recover from RocksDBException errors during put/get/delete operations. Recovery tries a safe reopen. If false, RocksDBException errors are propagated directly to the caller. | 0.7.0 | |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the failure of CI for worker.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants