Skip to content

Improvements to the results analysis app#77

Merged
gabegma merged 24 commits intomainfrom
pr/fr/results_app
Apr 28, 2026
Merged

Improvements to the results analysis app#77
gabegma merged 24 commits intomainfrom
pr/fr/results_app

Conversation

@fanny-riols
Copy link
Copy Markdown
Collaborator

@fanny-riols fanny-riols commented Apr 24, 2026

Summary

  • Cascading System filter: replaced the flat Model filter in the cross-run comparison with a System filter that shows a human-readable label and is constrained by the selected Provider, so irrelevant systems are hidden automatically.
  • Complete-runs-only & hide-incomplete toggles: two new sidebar toggles filter out partial runs from the comparison table and scatter plot; both are URL-bound so the view is shareable.
  • Per-sample heatmap: a new interactive heatmap below the comparison tables shows per-record metric scores across all runs, with a metric selector, swap-axes toggle, and correct RdYlGn / RdYlGn_r colorscale based on whether lower or higher is better.
  • Pinned columns & row numbers: the #, link, and System columns are now pinned in the dataframe so they stay visible when scrolling horizontally; the link column is labelled "Run".
  • URL-bound state: the latest-run toggle and show-sub-metrics toggle are now bound to query params, so deep-linking to a specific view works end-to-end.
  • Lower-is-better metric set: restricted the inverted colorscale to response_speed and stt_wer only (was applied too broadly).
  • Latest-run filter fix: timestamp-only folders now derive their system name from config.json so that different models sharing the same folder format are treated as distinct systems and deduplicated correctly.

Two bugs in filter_latest_runs:
- Timestamp-only folders (no system suffix) used the full folder name as
  system identity, so multiple haiku runs would never deduplicate. Now loads
  config.json to derive the system name the same way the display does.
- When multiple output directories are entered, runs were concatenated
  per-directory without a global sort, so an older run from dir1 could win
  over a newer run from dir2. Now sorted globally by folder name before filtering.
The previous change made timestamp-only folders (no system suffix) derive
their system name from config.json. This caused a newer timestamp-only run
to claim the same system name as an older suffixed run with a matching model
config, silently suppressing the suffixed run in filter_latest_runs.

Timestamp-only folders keep their full folder name as system name (unique),
so they never suppress suffixed runs with matching model configs.
Timestamp-only folders (no system suffix) derive their system identity
from config.json so that runs with different models (e.g. different LLMs
with the same STT/TTS) are treated as distinct systems, and multiple runs
of the same model are deduplicated to the latest one.
Bind 'Show sub-metrics' to query params so it persists in shared links.
Add 'Hide incomplete results' toggle (default on, URL-bound) that drops
rows missing any EVA-A or EVA-X metric from the table and scatter plot;
diagnostic and validation metrics are ignored.
Renders an interactive Plotly heatmap below the results tables with
sample IDs on one axis, systems on the other, and cell color encoding
the selected metric. Includes a metric dropdown, RdYlGn/RdYlGn_r
colorscale auto-selected by directionality, and a Swap Axes toggle
that reverses sample order on the y-axis to preserve top-to-bottom
reading order.
Replace substring-based keyword matching with an exact-name set containing only the two metrics that should use inverted color scales.
Comment thread apps/analysis.py Outdated
Comment on lines +83 to +85
# Metric names for which lower values are better
_LOWER_BETTER_METRICS = {"response_speed", "stt_wer"}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I detect automatically now from #65

Comment thread apps/analysis.py Outdated
Comment thread apps/analysis.py Outdated
Comment thread apps/analysis.py Outdated
gabegma and others added 11 commits April 25, 2026 16:08
- Remove redundant _LOWER_BETTER_METRICS set and _is_lower_better(); use _is_lower_is_better() (registry-driven) for heatmap colorscale instead
- Drop manual # row counter column; show default dataframe index
- Replace st.columns + markdown padding with st.container(horizontal=True) + st.space() for heatmap controls
@gabegma gabegma marked this pull request as ready for review April 28, 2026 15:45
@gabegma gabegma added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit 58e8a30 Apr 28, 2026
1 check passed
@gabegma gabegma deleted the pr/fr/results_app branch April 28, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants