Fix recall computation for fewer than k groundtruth results by magdalendobson · Pull Request #1069 · microsoft/DiskANN

magdalendobson · 2026-05-14T18:13:05Z

Groundtruth for filtered datasets uses the .rangeres format, since a filtered query may have few or no results. However, at the time of recall computation for top-k search, fewer than k results causes an error. This PR patches that error and computes recall for each point using the following paradigm:

If the point has 0 < k' <= k groundtruth results, recall is the number of correctly reported points divided by k'.
If the point has no groundtruth results, it is completely excluded from the recall calculation.

The average recall is then the sum of all recalls divided by the number of non-zero results. The points with zero results do not affect recall. The reasoning for this is that in a dataset with majority-null results for filtered queries, we want recall to capture our performance on the queries that actually have results, rather than being artificially high when we always correctly report that these points have no results.

Since this error will no longer trigger even for a regular top-k search in recall computation, I have pushed the error for not enough groundtruth results to when the groundtruth in the regular bin format is actually read. This is an improvement since a failure will now trigger before an index build and search, making a better experience for the user. I leave some of the instances of NotEnoughGroundTruth inside recall computation as an extra layer of failsafe, but they could also be removed if reviewers feel we should completely remove them.

Along the way I get rid of the minimum and maximum values in the recall stats, since after some discussion with the team it appears that no one uses them.

magdalendobson · 2026-05-14T18:18:49Z

@microsoft-github-policy-service agree company="Microsoft"

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes recall computation when filtered/variable-length groundtruth contains fewer than k results by computing per-query recall against the available groundtruth and excluding zero-result queries from the average.

Changes:

Update recall computation to handle per-row k' <= k groundtruth sizes and ignore empty groundtruth rows in the average.
Move “insufficient groundtruth for top-k” validation to groundtruth loading for fixed-size .bin groundtruth.
Remove minimum / maximum from reported recall metrics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
diskann-benchmark/src/utils/recall.rs	Removes min/max fields from the benchmark-facing recall metrics wrapper.
diskann-benchmark/src/utils/datafiles.rs	Adds optional `k` validation when loading groundtruth from `.bin`.
diskann-benchmark/src/backend/index/benchmarks.rs	Computes max `recall_k` across runs and passes it into groundtruth loading.
diskann-benchmark-core/src/recall.rs	Implements variable-length groundtruth recall + removes min/max metrics and updates tests accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…more neighbors than needed Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

…oundtruth tests Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

…groundtruth mix Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

…en/patch_recall_calculation

codecov-commenter · 2026-05-15T18:52:26Z

Codecov Report

❌ Patch coverage is 94.84127% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.46%. Comparing base (895a2c0) to head (e3cc3f4).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
diskann-benchmark/src/backend/index/benchmarks.rs	30.00%	7 Missing ⚠️
diskann-benchmark/src/backend/index/search/knn.rs	50.00%	5 Missing ⚠️
diskann-benchmark/src/utils/datafiles.rs	96.55%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1069      +/-   ##
==========================================
+ Coverage   89.45%   89.46%   +0.01%     
==========================================
  Files         458      458              
  Lines       85398    85530     +132     
==========================================
+ Hits        76392    76520     +128     
- Misses       9006     9010       +4

Flag	Coverage Δ
miri	`89.46% <94.84%> (+0.01%)`	⬆️
unittests	`89.09% <94.84%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-benchmark-core/src/recall.rs	`98.50% <100.00%> (+0.70%)`	⬆️
diskann-benchmark-core/src/search/graph/knn.rs	`95.08% <100.00%> (+0.16%)`	⬆️
...iskann-benchmark-core/src/search/graph/multihop.rs	`98.72% <100.00%> (+0.03%)`	⬆️
diskann-benchmark/src/backend/exhaustive/minmax.rs	`100.00% <ø> (ø)`
...iskann-benchmark/src/backend/exhaustive/product.rs	`100.00% <ø> (ø)`
...kann-benchmark/src/backend/exhaustive/spherical.rs	`100.00% <ø> (ø)`
diskann-benchmark/src/backend/index/spherical.rs	`100.00% <ø> (ø)`
diskann-benchmark/src/inputs/graph_index.rs	`36.28% <100.00%> (+0.32%)`	⬆️
diskann-benchmark/src/utils/recall.rs	`57.14% <ø> (-5.36%)`	⬇️
diskann-benchmark/src/utils/datafiles.rs	`87.36% <96.55%> (+3.78%)`	⬆️
... and 2 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

magdalendobson · 2026-05-19T21:24:10Z

@copilot resolve the merge conflicts in this pull request

Resolved conflict in diskann-benchmark/src/inputs/graph_index.rs: - Kept max_k() method from feature branch (needed for recall computation fix) - Adopted new validate() method pattern from main (replaces CheckDeserialization trait) - Updated method calls from check_deserialization to resolve Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Copilot · 2026-05-19T21:30:24Z

@copilot resolve the merge conflicts in this pull request

Resolved in 934789c. The merge conflict in diskann-benchmark/src/inputs/graph_index.rs has been resolved by:

Keeping the max_k() method from the feature branch (needed for the recall computation fix)
Adopting the new validate() method pattern from main (replaces the CheckDeserialization trait)
Updating method calls from check_deserialization to resolve

All tests pass and clippy/fmt checks are clean.

hildebrandmw

I had a thought - should this be behind a configuration option? (maybe ditching allow_insufficient_results since I think that should just always be the case).

My concern is that if we do a normal k-nearest-neighbors search and request recall at 100, but the groundtruth only has 10 nearest neighbors, we will silently get recall computed for 10 nearest neighbors instead of the expected 100. Basically, we need a way to opt-out of this behavior.

Here is my proposal:

We ditch the allow_insufficient_results flag. Algorithms can't always ensure this and it feels like something that callers should enforce anyways.
Add the following enum as an argument:
```
pub enum GroundTruthMode {
    Fixed,
    Flexible,
}
```
In Fixed mode, we ensure we have at least recall_k entries, and Flexible mode allows what's happening in this PR.

Using an enum makes updating the callsites easier and more explicit: we aren't changing one bool for another.
Update diskann_benchmark::utils::compute_multiple_recalls to do one upfront check that the results have at least recall_n entries.
Update the KNN searches to use GroundTruthMode::Fixed and the filtered KNN searches to use GroundTruthMode::Flexible

magdalendobson · 2026-05-20T20:14:08Z

I had a thought - should this be behind a configuration option? (maybe ditching allow_insufficient_results since I think that should just always be the case).

My concern is that if we do a normal k-nearest-neighbors search and request recall at 100, but the groundtruth only has 10 nearest neighbors, we will silently get recall computed for 10 nearest neighbors instead of the expected 100. Basically, we need a way to opt-out of this behavior.

Here is my proposal:
We ditch the allow_insufficient_results flag. Algorithms can't always ensure this and it feels like something that callers should enforce anyways.
Add the following enum as an argument:
pub enum GroundTruthMode {
    Fixed,
    Flexible,
}
In Fixed mode, we ensure we have at least recall_k entries, and Flexible mode allows what's happening in this PR.
Using an enum makes updating the callsites easier and more explicit: we aren't changing one bool for another.
Update diskann_benchmark::utils::compute_multiple_recalls to do one upfront check that the results have at least recall_n entries.

Update the KNN searches to use GroundTruthMode::Fixed and the filtered KNN searches to use GroundTruthMode::Flexible

I believe this is resolved now

hildebrandmw

Thanks Magdalen - two small comments then looks good to me!

finish up recall computation patch

97b36ef

Magdalen Manohar added 2 commits May 14, 2026 19:53

fix bug in recall computation

cc013bd

remove stray extra registered benchmarks

1b6c4aa

magdalendobson marked this pull request as ready for review May 14, 2026 20:00

magdalendobson requested review from a team and Copilot May 14, 2026 20:00

Copilot AI reviewed May 14, 2026

View reviewed changes

Copilot started reviewing on behalf of magdalendobson May 14, 2026 20:20 View session

magdalendobson linked an issue May 14, 2026 that may be closed by this pull request

Fix recall computation when fewer than k results are present #1073

Closed

Copilot started work on behalf of magdalendobson May 14, 2026 21:55 View session

magdalendobson and others added 3 commits May 14, 2026 17:55

Potential fix for pull request finding

d81f729

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix groundtruth ncols check: use < instead of != to allow files with …

f0b954d

…more neighbors than needed Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Add meaningful recall value assertions for variable-length dynamic gr…

3de6550

…oundtruth tests Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Copilot finished work on behalf of magdalendobson May 14, 2026 22:03

Copilot started work on behalf of magdalendobson May 14, 2026 22:03 View session

Copilot AI and others added 2 commits May 14, 2026 22:09

Update doc comment for average field and add tests for zero/non-zero …

fd8f772

…groundtruth mix Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Remove trailing whitespace from test code

87efda0

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Copilot finished work on behalf of magdalendobson May 14, 2026 22:12

Magdalen Manohar added 4 commits May 15, 2026 14:31

Merge branch 'main' of github.com:microsoft/DiskANN into users/magdal…

62c2120

…en/patch_recall_calculation

small changes to tests

7a45865

fmt

15398fa

fix clippy, add max_k() function to topk search

852fceb

add a test for loading groundtruth with expected k

bdee1db

harsha-simhadri approved these changes May 18, 2026

View reviewed changes

Comment thread diskann-benchmark-core/src/recall.rs Outdated

Comment thread diskann-benchmark/src/utils/datafiles.rs

hildebrandmw reviewed May 19, 2026

View reviewed changes

Comment thread diskann-benchmark-core/src/recall.rs Outdated

Comment thread diskann-benchmark-core/src/recall.rs Outdated

Comment thread diskann-benchmark-core/src/recall.rs Outdated

Comment thread diskann-benchmark-core/src/recall.rs Outdated

Magdalen Manohar added 3 commits May 19, 2026 15:49

apply suggestions from review

bb6ce11

rename error, add test for new error

2a72e3f

fmt + clippy

334c43f

Copilot started work on behalf of magdalendobson May 19, 2026 21:24 View session

Copilot finished work on behalf of magdalendobson May 19, 2026 21:32

hildebrandmw reviewed May 19, 2026

View reviewed changes

improve handling when not enough groundtruth

0614770

hildebrandmw reviewed May 20, 2026

View reviewed changes

Comment thread diskann-benchmark-core/src/recall.rs

Comment thread diskann-benchmark-core/src/recall.rs

added Debug, update docstring

be57b94

hildebrandmw approved these changes May 20, 2026

View reviewed changes

fmt

e3cc3f4

hildebrandmw approved these changes May 20, 2026

View reviewed changes

magdalendobson merged commit 3d09483 into main May 20, 2026
24 checks passed

magdalendobson deleted the users/magdalen/patch_recall_calculation branch May 20, 2026 21:08

Conversation

magdalendobson commented May 14, 2026

Uh oh!

magdalendobson commented May 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magdalendobson commented May 19, 2026

Uh oh!

Copilot AI commented May 19, 2026

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

magdalendobson commented May 20, 2026

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented May 15, 2026 •

edited

Loading