Skip to content

Fix recall computation for fewer than k groundtruth results#1069

Merged
magdalendobson merged 20 commits into
mainfrom
users/magdalen/patch_recall_calculation
May 20, 2026
Merged

Fix recall computation for fewer than k groundtruth results#1069
magdalendobson merged 20 commits into
mainfrom
users/magdalen/patch_recall_calculation

Conversation

@magdalendobson
Copy link
Copy Markdown
Contributor

Groundtruth for filtered datasets uses the .rangeres format, since a filtered query may have few or no results. However, at the time of recall computation for top-k search, fewer than k results causes an error. This PR patches that error and computes recall for each point using the following paradigm:

  1. If the point has 0 < k' <= k groundtruth results, recall is the number of correctly reported points divided by k'.
  2. If the point has no groundtruth results, it is completely excluded from the recall calculation.

The average recall is then the sum of all recalls divided by the number of non-zero results. The points with zero results do not affect recall. The reasoning for this is that in a dataset with majority-null results for filtered queries, we want recall to capture our performance on the queries that actually have results, rather than being artificially high when we always correctly report that these points have no results.

Since this error will no longer trigger even for a regular top-k search in recall computation, I have pushed the error for not enough groundtruth results to when the groundtruth in the regular bin format is actually read. This is an improvement since a failure will now trigger before an index build and search, making a better experience for the user. I leave some of the instances of NotEnoughGroundTruth inside recall computation as an extra layer of failsafe, but they could also be removed if reviewers feel we should completely remove them.

Along the way I get rid of the minimum and maximum values in the recall stats, since after some discussion with the team it appears that no one uses them.

@magdalendobson
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree company="Microsoft"

@magdalendobson magdalendobson marked this pull request as ready for review May 14, 2026 20:00
@magdalendobson magdalendobson requested review from a team and Copilot May 14, 2026 20:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes recall computation when filtered/variable-length groundtruth contains fewer than k results by computing per-query recall against the available groundtruth and excluding zero-result queries from the average.

Changes:

  • Update recall computation to handle per-row k' <= k groundtruth sizes and ignore empty groundtruth rows in the average.
  • Move “insufficient groundtruth for top-k” validation to groundtruth loading for fixed-size .bin groundtruth.
  • Remove minimum / maximum from reported recall metrics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
diskann-benchmark/src/utils/recall.rs Removes min/max fields from the benchmark-facing recall metrics wrapper.
diskann-benchmark/src/utils/datafiles.rs Adds optional k validation when loading groundtruth from .bin.
diskann-benchmark/src/backend/index/benchmarks.rs Computes max recall_k across runs and passes it into groundtruth loading.
diskann-benchmark-core/src/recall.rs Implements variable-length groundtruth recall + removes min/max metrics and updates tests accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-benchmark/src/utils/datafiles.rs
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark/src/backend/index/benchmarks.rs Outdated
Comment thread diskann-benchmark/src/utils/datafiles.rs
magdalendobson and others added 3 commits May 14, 2026 17:55
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…more neighbors than needed

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
…oundtruth tests

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Copilot AI and others added 2 commits May 14, 2026 22:09
…groundtruth mix

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 15, 2026

Codecov Report

❌ Patch coverage is 94.84127% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.46%. Comparing base (895a2c0) to head (e3cc3f4).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
diskann-benchmark/src/backend/index/benchmarks.rs 30.00% 7 Missing ⚠️
diskann-benchmark/src/backend/index/search/knn.rs 50.00% 5 Missing ⚠️
diskann-benchmark/src/utils/datafiles.rs 96.55% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1069      +/-   ##
==========================================
+ Coverage   89.45%   89.46%   +0.01%     
==========================================
  Files         458      458              
  Lines       85398    85530     +132     
==========================================
+ Hits        76392    76520     +128     
- Misses       9006     9010       +4     
Flag Coverage Δ
miri 89.46% <94.84%> (+0.01%) ⬆️
unittests 89.09% <94.84%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-benchmark-core/src/recall.rs 98.50% <100.00%> (+0.70%) ⬆️
diskann-benchmark-core/src/search/graph/knn.rs 95.08% <100.00%> (+0.16%) ⬆️
...iskann-benchmark-core/src/search/graph/multihop.rs 98.72% <100.00%> (+0.03%) ⬆️
diskann-benchmark/src/backend/exhaustive/minmax.rs 100.00% <ø> (ø)
...iskann-benchmark/src/backend/exhaustive/product.rs 100.00% <ø> (ø)
...kann-benchmark/src/backend/exhaustive/spherical.rs 100.00% <ø> (ø)
diskann-benchmark/src/backend/index/spherical.rs 100.00% <ø> (ø)
diskann-benchmark/src/inputs/graph_index.rs 36.28% <100.00%> (+0.32%) ⬆️
diskann-benchmark/src/utils/recall.rs 57.14% <ø> (-5.36%) ⬇️
diskann-benchmark/src/utils/datafiles.rs 87.36% <96.55%> (+3.78%) ⬆️
... and 2 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark/src/utils/datafiles.rs
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark-core/src/recall.rs Outdated
Comment thread diskann-benchmark-core/src/recall.rs Outdated
@magdalendobson
Copy link
Copy Markdown
Contributor Author

@copilot resolve the merge conflicts in this pull request

Resolved conflict in diskann-benchmark/src/inputs/graph_index.rs:
- Kept max_k() method from feature branch (needed for recall computation fix)
- Adopted new validate() method pattern from main (replaces CheckDeserialization trait)
- Updated method calls from check_deserialization to resolve

Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 19, 2026

@copilot resolve the merge conflicts in this pull request

Resolved in 934789c. The merge conflict in diskann-benchmark/src/inputs/graph_index.rs has been resolved by:

  • Keeping the max_k() method from the feature branch (needed for the recall computation fix)
  • Adopting the new validate() method pattern from main (replaces the CheckDeserialization trait)
  • Updating method calls from check_deserialization to resolve

All tests pass and clippy/fmt checks are clean.

Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a thought - should this be behind a configuration option? (maybe ditching allow_insufficient_results since I think that should just always be the case).

My concern is that if we do a normal k-nearest-neighbors search and request recall at 100, but the groundtruth only has 10 nearest neighbors, we will silently get recall computed for 10 nearest neighbors instead of the expected 100. Basically, we need a way to opt-out of this behavior.

Here is my proposal:

  1. We ditch the allow_insufficient_results flag. Algorithms can't always ensure this and it feels like something that callers should enforce anyways.

  2. Add the following enum as an argument:

    pub enum GroundTruthMode {
        Fixed,
        Flexible,
    }

    In Fixed mode, we ensure we have at least recall_k entries, and Flexible mode allows what's happening in this PR.

    Using an enum makes updating the callsites easier and more explicit: we aren't changing one bool for another.

  3. Update diskann_benchmark::utils::compute_multiple_recalls to do one upfront check that the results have at least recall_n entries.

  4. Update the KNN searches to use GroundTruthMode::Fixed and the filtered KNN searches to use GroundTruthMode::Flexible

@magdalendobson
Copy link
Copy Markdown
Contributor Author

I had a thought - should this be behind a configuration option? (maybe ditching allow_insufficient_results since I think that should just always be the case).

My concern is that if we do a normal k-nearest-neighbors search and request recall at 100, but the groundtruth only has 10 nearest neighbors, we will silently get recall computed for 10 nearest neighbors instead of the expected 100. Basically, we need a way to opt-out of this behavior.

Here is my proposal:

  1. We ditch the allow_insufficient_results flag. Algorithms can't always ensure this and it feels like something that callers should enforce anyways.

  2. Add the following enum as an argument:

    pub enum GroundTruthMode {
        Fixed,
        Flexible,
    }

    In Fixed mode, we ensure we have at least recall_k entries, and Flexible mode allows what's happening in this PR.
    Using an enum makes updating the callsites easier and more explicit: we aren't changing one bool for another.

  3. Update diskann_benchmark::utils::compute_multiple_recalls to do one upfront check that the results have at least recall_n entries.

  4. Update the KNN searches to use GroundTruthMode::Fixed and the filtered KNN searches to use GroundTruthMode::Flexible

I believe this is resolved now

Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Magdalen - two small comments then looks good to me!

Comment thread diskann-benchmark-core/src/recall.rs
Comment thread diskann-benchmark-core/src/recall.rs
@magdalendobson magdalendobson merged commit 3d09483 into main May 20, 2026
24 checks passed
@magdalendobson magdalendobson deleted the users/magdalen/patch_recall_calculation branch May 20, 2026 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix recall computation when fewer than k results are present

6 participants