[CELEBORN-2334] Automatically restore RocksDB in case of failures by AmandeepSingh285 · Pull Request #3695 · apache/celeborn

AmandeepSingh285 · 2026-05-18T13:40:24Z

What changes were proposed in this pull request?

The patch re-instantiates RocksDB in case of failures. In the current implementation, when RocksDB enters a read-only mode due to failures, Celeborn metadata operations fail and remain blocked until manual intervention or restart. This pull request adds logic to detect such RocksDB failures and re-instantiate the RocksDB instance so that metadata operations can recover automatically and continue functioning without prolonged disruption. RocksDB can enter a read-only or unusable state under scenarios such as: corruption in files, errors from underlying file system. In such cases, RocksDB prevents further writes to protect data consistency, which causes Celeborn metadata operations to fail.

Why are the changes needed?

Once RocksDB enters a read-only or error state, Celeborn metadata operations become unavailable because the existing RocksDB instance remains unusable, which could lead to failures in metadata updates.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

Copilot

Pull request overview

Adds automatic recovery for the worker's RocksDB-backed metadata store: when a DB operation throws RocksDBException, the RocksDB wrapper closes the existing native instance and reopens it via RocksDBProvider.initRockDB, coordinated by a ReentrantReadWriteLock and a generation counter so only one thread performs the reopen.

Changes:

RocksDB now holds a volatile db reference plus dbFile/version and exposes a recreateDBInstance(generation) helper invoked from each operation's catch block.
All putInternal/getInternal/deleteInternal/newIterator/close methods are wrapped in read-/write-lock scopes around the generation snapshot.
DBProvider.initDB passes dbFile and version through to the new RocksDB constructor.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDB.java	Adds locking, generation counter, and `recreateDBInstance` recovery in each DB operation.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProvider.java	Threads `dbFile`/`version` into the updated `RocksDB` constructor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

SteNicholas · 2026-05-21T12:05:47Z

@AmandeepSingh285, please update the ConfigOption to align with the changes of configuration/worker.md for failure of CI.

AmandeepSingh285 · 2026-05-25T07:11:51Z

@AmandeepSingh285, thanks for updates. I have left minor comments. PTAL.

Thanks @SteNicholas , made changes according to the comments

RexXiong · 2026-05-27T08:08:03Z

Thanks for the work on this PR — the overall design (ManagedRocksDB lifecycle wrapper, generation + ReadWriteLock for concurrent recovery dedup, stale iterator detection) is solid. A couple of observations:

1. tryRecoverDBInstance: failed reopen leaves DB in an infinite recovery loop

If reopenRocksDB throws, db still references the already-closed ManagedRocksDB, and the generation has already been incremented:

try {
    if (db != null) {
        db.close();           // old DB closed
    }
} catch (Exception e) { ... }

dbGeneration.incrementAndGet();   // generation bumped

try {
    db = RocksDBProvider.reopenRocksDB(dbFile, conf);  // if this fails...
} catch (IOException e) {
    logger.error("Safe reopen failed ...", e);
    // db still points to the CLOSED ManagedRocksDB
    // generation already incremented → dedup won't block next attempt
}

Every subsequent operation will: use the closed DB → throw → trigger a new recovery (passes the dedup check since generation advanced) → close the already-closed DB → attempt reopen → fail again. Each operation pays the cost of a write-lock acquisition + a full reopen attempt, and it never converges.

Suggestion: mark the DB as terminally failed after reopen failure so that checkState() fast-fails with a clear message (e.g., "DB recovery failed, manual intervention required") instead of looping:

} catch (IOException e) {
    logger.error("Safe reopen failed for RocksDB at {}.", dbFile, e);
    db = null;
    closed = true;  // or a dedicated recoveryFailed flag
}

2. Minor: testNewIteratorReturnsUsableIterator assertion is fragile

The test puts 2 entries ("a", "b") but asserts seen >= 3, relying on the version entry written internally by initRockDB. If the version storage mechanism changes, this test will break. Consider asserting seen >= 2 or using an exact count.

Reviewed with Claude Code

AmandeepSingh285 · 2026-05-27T12:09:47Z

Thanks @RexXiong for the review.

Recovery after failed reopen - The goal of the change was to have RocksDB automatically recovered rather than entering a terminal state. In the current state as well, in case of any failures when RocksDB enters a Read only mode, a manual intervention is required to have metadata updates unblocked which I tried to solve through this PR.
This is true that the test is relying on the version entry written. I can update the same with a small change if required.

AmandeepSingh285 · 2026-05-31T06:11:52Z

@RexXiong have updated the test with the commented changes. Could you please help with review

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

SteNicholas · 2026-06-12T02:34:10Z

Ping @RexXiong, PTAL.

AmandeepSingh285 · 2026-06-15T06:31:36Z

Hi @RexXiong , gentle reminder for review. Thanks!

RexXiong

Sorry for the late reply, I left a minor comment.

AmandeepSingh285 · 2026-06-16T06:09:30Z

Thanks @SteNicholas , @RexXiong for the review. Have made changes based on the comments.

SteNicholas · 2026-06-17T12:27:39Z

 | celeborn.master.endpoints.resolver | org.apache.celeborn.common.client.StaticMasterEndpointResolver | false | Resolver class that can be used for discovering and updating the master endpoints. This allows users to provide a custom master endpoint resolver implementation. This is useful in environments where the master nodes might change due to scaling operations or infrastructure updates. Clients need to ensure that provided resolver class should be present in the classpath. | 0.5.2 |  | 
 | celeborn.master.estimatedPartitionSize.minSize | 8mb | false | Ignore partition size smaller than this configuration of partition size for estimation. | 0.3.0 | celeborn.shuffle.minPartitionSizeToEstimate | 
 | celeborn.master.internal.endpoints | &lt;localhost&gt;:8097 | false | Endpoints of master nodes just for celeborn workers to connect, allowed pattern is: `<host1>:<port1>[,<host2>:<port2>]*`, e.g. `clb1:8097,clb2:8097,clb3:8097`. If the port is omitted, 8097 will be used. | 0.5.0 |  | 
+| celeborn.worker.graceful.shutdown.recoverDb.rocksdb.autoRecovery.enabled | false | false | If true, the metadata DB will automatically attempt to recover from RocksDBException errors during put/get/delete operations. Recovery tries a safe reopen. If false, RocksDBException errors are propagated directly to the caller. | 0.7.0 |  | 


Please fix the failure of CI for worker.md.

Auto-recover rocks db in case of failures

4462856

github-actions Bot added the module:worker label May 18, 2026

SteNicholas changed the title ~~[CELEBORN-2334] Auto-recover rocks db in case of failures~~ [CELEBORN-2334] Auto-recover RocksDB in case of failures May 19, 2026

SteNicholas changed the title ~~[CELEBORN-2334] Auto-recover RocksDB in case of failures~~ [CELEBORN-2334] Automatically restore RocksDB in case of failures May 19, 2026

SteNicholas requested a review from Copilot May 19, 2026 03:37

Copilot started reviewing on behalf of SteNicholas May 19, 2026 03:37 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

github-actions Bot added correctness Correctness bugfix and removed correctness Correctness bugfix labels May 19, 2026

Addressed comments

fc731b2

github-actions Bot added kind:documentation module:common labels May 19, 2026

SteNicholas requested a review from Copilot May 19, 2026 18:18

Copilot started reviewing on behalf of SteNicholas May 19, 2026 18:18 View session

SteNicholas reviewed May 19, 2026

View reviewed changes

Comment thread worker/src/main/scala/org/apache/celeborn/service/deploy/worker/WorkerSource.scala Outdated

Copilot AI reviewed May 19, 2026

SteNicholas requested a review from Copilot May 20, 2026 02:29

Copilot started reviewing on behalf of SteNicholas May 20, 2026 02:29 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Addressed comments

426af32

SteNicholas requested a review from Copilot May 21, 2026 02:29

Copilot started reviewing on behalf of SteNicholas May 21, 2026 02:30 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Addressed comments

6bdc9f3

AmandeepSingh285 requested a review from SteNicholas May 21, 2026 09:53

SteNicholas requested a review from Copilot May 21, 2026 11:56

Copilot started reviewing on behalf of SteNicholas May 21, 2026 11:56 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDB.java

Comment thread ...src/test/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDBRecoverySuiteJ.java Outdated

Addressed comments

9a78c0d

AmandeepSingh285 requested a review from SteNicholas May 25, 2026 07:08

SteNicholas approved these changes May 25, 2026

View reviewed changes

RexXiong mentioned this pull request May 25, 2026

[CELEBORN-2336] Support decommission shutdown for worker scale-down scenarios #3698

Open

2 tasks

SteNicholas force-pushed the main branch 2 times, most recently from 1d92a40 to cf8d472 Compare May 27, 2026 02:11

Fixed test to not use version pushed

f181891

SteNicholas requested a review from Copilot June 2, 2026 11:54

Copilot started reviewing on behalf of SteNicholas June 2, 2026 11:54 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread common/src/main/java/org/apache/celeborn/common/protocol/RocksDBCompressionType.java

SteNicholas requested a review from RexXiong June 8, 2026 19:23

RexXiong reviewed Jun 15, 2026

View reviewed changes