Improvements to the results analysis app by fanny-riols · Pull Request #77 · ServiceNow/eva

fanny-riols · 2026-04-24T20:30:04Z

Summary

Cascading System filter: replaced the flat Model filter in the cross-run comparison with a System filter that shows a human-readable label and is constrained by the selected Provider, so irrelevant systems are hidden automatically.
Complete-runs-only & hide-incomplete toggles: two new sidebar toggles filter out partial runs from the comparison table and scatter plot; both are URL-bound so the view is shareable.
Per-sample heatmap: a new interactive heatmap below the comparison tables shows per-record metric scores across all runs, with a metric selector, swap-axes toggle, and correct RdYlGn / RdYlGn_r colorscale based on whether lower or higher is better.
Pinned columns & row numbers: the #, link, and System columns are now pinned in the dataframe so they stay visible when scrolling horizontally; the link column is labelled "Run".
URL-bound state: the latest-run toggle and show-sub-metrics toggle are now bound to query params, so deep-linking to a specific view works end-to-end.
Lower-is-better metric set: restricted the inverted colorscale to response_speed and stt_wer only (was applied too broadly).
Latest-run filter fix: timestamp-only folders now derive their system name from config.json so that different models sharing the same folder format are treated as distinct systems and deduplicated correctly.

Two bugs in filter_latest_runs: - Timestamp-only folders (no system suffix) used the full folder name as system identity, so multiple haiku runs would never deduplicate. Now loads config.json to derive the system name the same way the display does. - When multiple output directories are entered, runs were concatenated per-directory without a global sort, so an older run from dir1 could win over a newer run from dir2. Now sorted globally by folder name before filtering.

The previous change made timestamp-only folders (no system suffix) derive their system name from config.json. This caused a newer timestamp-only run to claim the same system name as an older suffixed run with a matching model config, silently suppressing the suffixed run in filter_latest_runs. Timestamp-only folders keep their full folder name as system name (unique), so they never suppress suffixed runs with matching model configs.

…ison

Timestamp-only folders (no system suffix) derive their system identity from config.json so that runs with different models (e.g. different LLMs with the same STT/TTS) are treated as distinct systems, and multiple runs of the same model are deduplicated to the latest one.

Bind 'Show sub-metrics' to query params so it persists in shared links. Add 'Hide incomplete results' toggle (default on, URL-bound) that drops rows missing any EVA-A or EVA-X metric from the table and scatter plot; diagnostic and validation metrics are ignored.

… comparison tables

Renders an interactive Plotly heatmap below the results tables with sample IDs on one axis, systems on the other, and cell color encoding the selected metric. Includes a metric dropdown, RdYlGn/RdYlGn_r colorscale auto-selected by directionality, and a Swap Axes toggle that reverses sample order on the y-axis to preserve top-to-bottom reading order.

Replace substring-based keyword matching with an exact-name set containing only the two metrics that should use inverted color scales.

gabegma · 2026-04-25T02:53:30Z

+# Metric names for which lower values are better
+_LOWER_BETTER_METRICS = {"response_speed", "stt_wer"}
+


I detect automatically now from #65

- Remove redundant _LOWER_BETTER_METRICS set and _is_lower_better(); use _is_lower_is_better() (registry-driven) for heatmap colorscale instead - Drop manual # row counter column; show default dataframe index - Replace st.columns + markdown padding with st.container(horizontal=True) + st.space() for heatmap controls

fanny-riols added 13 commits April 17, 2026 15:27

Persist latest-run toggle in URL and propagate to run overview links

11727c5

Replace Model filter with cascading System filter in cross-run compar…

e94f9c3

…ison

Add 'Complete runs only' toggle to cross-run comparison

179b75e

Apply complete-runs-only filter to scatter plot

fa24303

Add pinned row IDs, link label, and pinned System column to cross-run…

079e210

… comparison tables

Restrict lower-is-better to response_speed and stt_wer only

3e8b5d3

Replace substring-based keyword matching with an exact-name set containing only the two metrics that should use inverted color scales.

Merge branch 'main' into pr/fr/results_app

1d9855d

Hide app errors from UI, log file-load failures to stderr

9f7dfc3

gabegma reviewed Apr 25, 2026

View reviewed changes

JosephMarinier reviewed Apr 25, 2026

View reviewed changes

Comment thread apps/analysis.py Outdated

JosephMarinier reviewed Apr 25, 2026

View reviewed changes

Comment thread apps/analysis.py Outdated

JosephMarinier reviewed Apr 25, 2026

View reviewed changes

Comment thread apps/analysis.py Outdated

JosephMarinier approved these changes Apr 25, 2026

View reviewed changes

gabegma and others added 11 commits April 25, 2026 16:08

Merge branch 'main' into pr/fr/results_app

d648c86

Fix links with toggles

a6bec12

Fix total record when there is no metric file

10c9e99

Add domains

70ee505

Fix complete runs with multiple domains

0905356

Remove hide incomplete toggle

9d9a649

Add toggle to summarize accross domains

f637651

Stop relying on the folder name to detect failures

ef267fa

Add domain on hover

055c182

Add table to see any metric per model/domain

cf8d7c2

gabegma approved these changes Apr 28, 2026

View reviewed changes

gabegma marked this pull request as ready for review April 28, 2026 15:45

gabegma added this pull request to the merge queue Apr 28, 2026

Merged via the queue into main with commit 58e8a30 Apr 28, 2026
1 check passed

gabegma deleted the pr/fr/results_app branch April 28, 2026 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to the results analysis app#77

Improvements to the results analysis app#77
gabegma merged 24 commits intomainfrom
pr/fr/results_app

fanny-riols commented Apr 24, 2026 •

edited

Loading

Uh oh!

gabegma Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Metric names for which lower values are better
		_LOWER_BETTER_METRICS = {"response_speed", "stt_wer"}

Conversation

fanny-riols commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

gabegma Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fanny-riols commented Apr 24, 2026 •

edited

Loading