Fix recall computation for fewer than k groundtruth results#1069
Conversation
|
@microsoft-github-policy-service agree company="Microsoft" |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes recall computation when filtered/variable-length groundtruth contains fewer than k results by computing per-query recall against the available groundtruth and excluding zero-result queries from the average.
Changes:
- Update recall computation to handle per-row
k' <= kgroundtruth sizes and ignore empty groundtruth rows in the average. - Move “insufficient groundtruth for top-k” validation to groundtruth loading for fixed-size
.bingroundtruth. - Remove
minimum/maximumfrom reported recall metrics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| diskann-benchmark/src/utils/recall.rs | Removes min/max fields from the benchmark-facing recall metrics wrapper. |
| diskann-benchmark/src/utils/datafiles.rs | Adds optional k validation when loading groundtruth from .bin. |
| diskann-benchmark/src/backend/index/benchmarks.rs | Computes max recall_k across runs and passes it into groundtruth loading. |
| diskann-benchmark-core/src/recall.rs | Implements variable-length groundtruth recall + removes min/max metrics and updates tests accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…more neighbors than needed Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
…oundtruth tests Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/5f7cd8dc-8e4f-4c69-aea5-ad2c61baee52 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
…groundtruth mix Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/75b8623f-538e-4a4c-9762-12aac534c708 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1069 +/- ##
==========================================
+ Coverage 89.45% 89.46% +0.01%
==========================================
Files 458 458
Lines 85398 85530 +132
==========================================
+ Hits 76392 76520 +128
- Misses 9006 9010 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@copilot resolve the merge conflicts in this pull request |
Resolved conflict in diskann-benchmark/src/inputs/graph_index.rs: - Kept max_k() method from feature branch (needed for recall computation fix) - Adopted new validate() method pattern from main (replaces CheckDeserialization trait) - Updated method calls from check_deserialization to resolve Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Resolved in 934789c. The merge conflict in
All tests pass and clippy/fmt checks are clean. |
hildebrandmw
left a comment
There was a problem hiding this comment.
I had a thought - should this be behind a configuration option? (maybe ditching allow_insufficient_results since I think that should just always be the case).
My concern is that if we do a normal k-nearest-neighbors search and request recall at 100, but the groundtruth only has 10 nearest neighbors, we will silently get recall computed for 10 nearest neighbors instead of the expected 100. Basically, we need a way to opt-out of this behavior.
Here is my proposal:
-
We ditch the
allow_insufficient_resultsflag. Algorithms can't always ensure this and it feels like something that callers should enforce anyways. -
Add the following enum as an argument:
pub enum GroundTruthMode { Fixed, Flexible, }
In
Fixedmode, we ensure we have at leastrecall_kentries, andFlexiblemode allows what's happening in this PR.Using an
enummakes updating the callsites easier and more explicit: we aren't changing oneboolfor another. -
Update
diskann_benchmark::utils::compute_multiple_recallsto do one upfront check that the results have at leastrecall_nentries. -
Update the KNN searches to use
GroundTruthMode::Fixedand the filtered KNN searches to useGroundTruthMode::Flexible
I believe this is resolved now |
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Magdalen - two small comments then looks good to me!
Groundtruth for filtered datasets uses the
.rangeresformat, since a filtered query may have few or no results. However, at the time of recall computation for top-k search, fewer thankresults causes an error. This PR patches that error and computes recall for each point using the following paradigm:0 < k' <= kgroundtruth results, recall is the number of correctly reported points divided byk'.The average recall is then the sum of all recalls divided by the number of non-zero results. The points with zero results do not affect recall. The reasoning for this is that in a dataset with majority-null results for filtered queries, we want recall to capture our performance on the queries that actually have results, rather than being artificially high when we always correctly report that these points have no results.
Since this error will no longer trigger even for a regular top-k search in recall computation, I have pushed the error for not enough groundtruth results to when the groundtruth in the regular bin format is actually read. This is an improvement since a failure will now trigger before an index build and search, making a better experience for the user. I leave some of the instances of
NotEnoughGroundTruthinside recall computation as an extra layer of failsafe, but they could also be removed if reviewers feel we should completely remove them.Along the way I get rid of the
minimumandmaximumvalues in the recall stats, since after some discussion with the team it appears that no one uses them.