support checkpoint_no_save_replay_buffer#1899
Conversation
|
@claude review |
| @@ -979,12 +1007,17 @@ async def resume(self, checkpoint_path: Path | str) -> int: | |||
| checkpoint_path = Path(checkpoint_path) | |||
| for task in self.task_runners: | |||
| task.sampler.resume(self._task_checkpoint_path(checkpoint_path, task.task_name)) | |||
There was a problem hiding this comment.
Nit: Backward compatibility — old checkpoints (saved before this PR) won't have the replay_buffer_saved key. The get(..., True) default handles this correctly, so existing checkpoints will always attempt replay buffer restore. Good.
|
Claude: Code review completed for PR #1899 ( SummaryThis PR adds a IssuesWarning
Nit
ProduceBatchResult impact
RoutedExperts impactNot affected — no changes to routed-experts ownership, object-ref plumbing, or rollout response handling. VerdictAPPROVE — the core logic is sound, the backward-compat default is correct, and the test covers the main path. The one suggestion is to add a comment explaining the resume invariant. |
| task_names = list(consumed_samples) | ||
| next_consumer_step = int(progress_state["next_consumer_step"]) | ||
|
|
||
| progress_state["producer_future_step"] = next_consumer_step |
There was a problem hiding this comment.
ok, rollback producer_future_step and related states to next_consumer_step if drop replay_buffer
No description provided.