feat(validation): log reward subscores in validate_model#152
Open
go-sakayori wants to merge 5 commits into
Open
feat(validation): log reward subscores in validate_model#152go-sakayori wants to merge 5 commits into
go-sakayori wants to merge 5 commits into
Conversation
Implements the validation-logging part of #130: wire compute_subscores_batch into the base-SFT validation loop so best-model selection can see the same rule-based subscores the training reward is built from. Additive only — avg_loss_* and best-model selection (still position_lat_loss) are unchanged. - _reward_subscores_per_scene helper: compute_subscores_batch is single-scene / N-trajectory, so it loops over the B scenes (one ego prediction each) and stacks the continuous subscores. Uses the same metre ego-frame tensors the existing rb / neighbor metrics use (prediction[:,0], neighbors_future, denorm_inputs[...]). - Logged as ego_subscore_{safety,ttc,progress,comfort,centerline,red_light, feasibility} so train.py's mean_ego_loss picks them up as valid_loss/ego_subscore_*. Tested: test_validation_subscores pins the per-scene batched helper against a direct single-scene compute_subscores_batch call; golden + parity + reward suite green. Subscore values should be sanity-checked on a real validation run.
8c29ab8 to
98934df
Compare
There was a problem hiding this comment.
Pull request overview
Adds logging of the rule-based reward’s component subscores during validate_model, so validation metrics expose the same physical quantities used by the training reward (without changing existing avg_loss_* reporting or best-model selection criteria).
Changes:
- Added
_reward_subscores_per_scenehelper invalidate_model.pyto callplanner_metrics.compute_subscores_batchper scene and stack requested subscores. - Logged subscores as additive validation metrics (
ego_subscore_{...}) collected via the existingtotal_result_dictpathway. - Added a unit test that checks the per-scene helper matches direct
compute_subscores_batchcalls and produces finite outputs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| diffusion_planner/diffusion_planner/validate_model.py | Computes + logs reward subscores per validation scene as additional ego_subscore_* metrics. |
| diffusion_planner/tests/test_validation_subscores.py | New unit tests validating helper slicing/stacking correctness and finite outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Adds planner_metrics.epdms_like.epdms_like_aggregate: a single [0,1] EPDMS-structured proxy score = (product of binary gates NC/DAC/TLC/kinematic) x (weighted average of [0,1] quality terms ttc/progress/comfort/lane). It flips and bounds the raw reward subscores (0 = best, penalties negative) into a score where 1 = perfect. Wired into validate_model.py as log-only metrics (ego_subscore_epdms_like plus gate_*/q_* components); checkpoint selection stays on L2 (mean_ego_loss). EPDMS keys are only emitted when gt_progress is passed, so the existing subscore-logging test is unaffected. This is an EPDMS-structured proxy, NOT a faithful NAVSIM EPDMS port (the faithful one lives in OnePlanner, see #142): it scores raw predicted waypoints (no controller rollout), omits DDC, has no false-positive filtering, and uses mean|jerk| for comfort. Thresholds/weights in EPDMSLikeConfig still need calibrating so GT trajectories score ~1.0. Includes unit tests in test_epdms_like_aggregate.py.
Signed-off-by: Go Sakayori <gsakayori@gmail.com>
Comment on lines
+236
to
+240
| data_batched = { | ||
| "ego_shape": inputs["ego_shape"], | ||
| "neighbor_agents_future": neighbors_future, | ||
| "neighbor_agents_past": denorm_inputs["neighbor_agents_past"], | ||
| } |
Comment on lines
+167
to
+174
| w_sum = cfg.w_ttc + cfg.w_progress + cfg.w_comfort + cfg.w_lane | ||
| quality = ( | ||
| cfg.w_ttc * ttc_q | ||
| + cfg.w_progress * progress_q | ||
| + cfg.w_comfort * comfort_q | ||
| + cfg.w_lane * lane_q | ||
| ) / w_sum | ||
|
|
- validate_model.py: include ego_agent_future in data_batched so compute_progress_score_batch falls back to the goal-directed GT endpoint instead of the predicted path length when goal_pose is absent/far (>100 m). Keeps the progress / q_progress term honest. - epdms_like.py: guard against zero-sum quality weights (raise instead of producing NaN), and add the Apache 2.0 / TIER IV license header. - test_epdms_like_aggregate.py, test_validation_subscores.py: add the license header to match repo convention.
…egate The EPDMS-like component keys were duplicated between this constant and the components dict returned by epdms_like_aggregate, a sync hazard. Accumulate whatever keys the aggregate returns instead, making epdms_like_aggregate the single source of truth. _VAL_SUBSCORE_KEYS stays -- it is a genuine allowlist filtering the ~20-key compute_subscores_batch output.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Log reward subscores in
validate_modelSurfaces the planner's rule-based reward subscores as additive validation metrics, so best-model selection can see the same physical quantities the training reward is built from.
avg_loss_*and best-model selection (stillposition_lat_loss) are unchanged.What changed —
diffusion_planner/diffusion_planner/validate_model.py_reward_subscores_per_scenehelper —compute_subscores_batchis single-scene / N-trajectory (its map + neighbor terms use one scene's tensors), so it loops over theBscenes (one ego prediction each) and stacks the continuous subscores. It reuses the same metre ego-frame tensors the existing road-border / neighbor metrics already use (prediction[:, 0], the maskedneighbors_future,denorm_inputs[...]).ego_subscore_{safety,ttc,progress,comfort,centerline,red_light,feasibility}, picked up bytrain.py'smean_ego_lossasvalid_loss/ego_subscore_*.Why this is correct
dataset.__getitem__justnp.loads it), so the subscores' tensor/channel conventions match — the same reasoncompute_road_border_penaltyalready runs in this loop.Tests
test_validation_subscores(new) pins the per-scene batched helper against a direct single-scenecompute_subscores_batchcall (slicing correctness + finite values).Caveats
Bcompute_subscores_batchcalls per batch to validation (per-epoch, not per-step); easy to gate behind a config flag if it proves too slow.