Skip to content

feat(validation): log reward subscores in validate_model#152

Open
go-sakayori wants to merge 5 commits into
tier4-mainfrom
feat/validation-subscore-logging
Open

feat(validation): log reward subscores in validate_model#152
go-sakayori wants to merge 5 commits into
tier4-mainfrom
feat/validation-subscore-logging

Conversation

@go-sakayori

@go-sakayori go-sakayori commented Jun 18, 2026

Copy link
Copy Markdown

Log reward subscores in validate_model

Surfaces the planner's rule-based reward subscores as additive validation metrics, so best-model selection can see the same physical quantities the training reward is built from. avg_loss_* and best-model selection (still position_lat_loss) are unchanged.

What changed — diffusion_planner/diffusion_planner/validate_model.py

  • _reward_subscores_per_scene helpercompute_subscores_batch is single-scene / N-trajectory (its map + neighbor terms use one scene's tensors), so it loops over the B scenes (one ego prediction each) and stacks the continuous subscores. It reuses the same metre ego-frame tensors the existing road-border / neighbor metrics already use (prediction[:, 0], the masked neighbors_future, denorm_inputs[...]).
  • Logged as ego_subscore_{safety,ttc,progress,comfort,centerline,red_light,feasibility}, picked up by train.py's mean_ego_loss as valid_loss/ego_subscore_*.

Why this is correct

  • The validation dataset feeds the same npz the reward consumes (dataset.__getitem__ just np.loads it), so the subscores' tensor/channel conventions match — the same reason compute_road_border_penalty already runs in this loop.
  • The subscores are the exact same code the reward uses, guarded by the existing golden + parity tests.

Tests

  • test_validation_subscores (new) pins the per-scene batched helper against a direct single-scene compute_subscores_batch call (slicing correctness + finite values).
  • Golden + parity + full reward suite green (141 passed).

Caveats

  • The subscore values should be sanity-checked on a real validation run (effective from the next run only — in-flight training is unaffected).
  • The per-scene loop adds B compute_subscores_batch calls per batch to validation (per-epoch, not per-step); easy to gate behind a config flag if it proves too slow.

@go-sakayori go-sakayori changed the title feat(validation): log reward subscores in validate_model (PR3 for #130) feat(validation): log reward subscores in validate_model Jun 18, 2026
Implements the validation-logging part of #130: wire compute_subscores_batch into
the base-SFT validation loop so best-model selection can see the same rule-based
subscores the training reward is built from. Additive only — avg_loss_* and
best-model selection (still position_lat_loss) are unchanged.

- _reward_subscores_per_scene helper: compute_subscores_batch is single-scene /
  N-trajectory, so it loops over the B scenes (one ego prediction each) and stacks
  the continuous subscores. Uses the same metre ego-frame tensors the existing
  rb / neighbor metrics use (prediction[:,0], neighbors_future, denorm_inputs[...]).
- Logged as ego_subscore_{safety,ttc,progress,comfort,centerline,red_light,
  feasibility} so train.py's mean_ego_loss picks them up as valid_loss/ego_subscore_*.

Tested: test_validation_subscores pins the per-scene batched helper against a direct
single-scene compute_subscores_batch call; golden + parity + reward suite green.
Subscore values should be sanity-checked on a real validation run.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds logging of the rule-based reward’s component subscores during validate_model, so validation metrics expose the same physical quantities used by the training reward (without changing existing avg_loss_* reporting or best-model selection criteria).

Changes:

  • Added _reward_subscores_per_scene helper in validate_model.py to call planner_metrics.compute_subscores_batch per scene and stack requested subscores.
  • Logged subscores as additive validation metrics (ego_subscore_{...}) collected via the existing total_result_dict pathway.
  • Added a unit test that checks the per-scene helper matches direct compute_subscores_batch calls and produces finite outputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
diffusion_planner/diffusion_planner/validate_model.py Computes + logs reward subscores per validation scene as additional ego_subscore_* metrics.
diffusion_planner/tests/test_validation_subscores.py New unit tests validating helper slicing/stacking correctness and finite outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diffusion_planner/tests/test_validation_subscores.py
Adds planner_metrics.epdms_like.epdms_like_aggregate: a single [0,1]
EPDMS-structured proxy score = (product of binary gates NC/DAC/TLC/kinematic)
x (weighted average of [0,1] quality terms ttc/progress/comfort/lane). It
flips and bounds the raw reward subscores (0 = best, penalties negative) into
a score where 1 = perfect.

Wired into validate_model.py as log-only metrics (ego_subscore_epdms_like
plus gate_*/q_* components); checkpoint selection stays on L2 (mean_ego_loss).
EPDMS keys are only emitted when gt_progress is passed, so the existing
subscore-logging test is unaffected.

This is an EPDMS-structured proxy, NOT a faithful NAVSIM EPDMS port (the
faithful one lives in OnePlanner, see #142): it scores raw predicted
waypoints (no controller rollout), omits DDC, has no false-positive
filtering, and uses mean|jerk| for comfort. Thresholds/weights in
EPDMSLikeConfig still need calibrating so GT trajectories score ~1.0.

Includes unit tests in test_epdms_like_aggregate.py.
Signed-off-by: Go Sakayori <gsakayori@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment on lines +236 to +240
data_batched = {
"ego_shape": inputs["ego_shape"],
"neighbor_agents_future": neighbors_future,
"neighbor_agents_past": denorm_inputs["neighbor_agents_past"],
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 98f73b2

Comment on lines +167 to +174
w_sum = cfg.w_ttc + cfg.w_progress + cfg.w_comfort + cfg.w_lane
quality = (
cfg.w_ttc * ttc_q
+ cfg.w_progress * progress_q
+ cfg.w_comfort * comfort_q
+ cfg.w_lane * lane_q
) / w_sum

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 98f73b2

- validate_model.py: include ego_agent_future in data_batched so
  compute_progress_score_batch falls back to the goal-directed GT endpoint
  instead of the predicted path length when goal_pose is absent/far (>100 m).
  Keeps the progress / q_progress term honest.
- epdms_like.py: guard against zero-sum quality weights (raise instead of
  producing NaN), and add the Apache 2.0 / TIER IV license header.
- test_epdms_like_aggregate.py, test_validation_subscores.py: add the license
  header to match repo convention.
…egate

The EPDMS-like component keys were duplicated between this constant and the
components dict returned by epdms_like_aggregate, a sync hazard. Accumulate
whatever keys the aggregate returns instead, making epdms_like_aggregate the
single source of truth. _VAL_SUBSCORE_KEYS stays -- it is a genuine allowlist
filtering the ~20-key compute_subscores_batch output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants