Remove centroid from most PQ interfaces.#1010
Conversation
There was a problem hiding this comment.
Pull request overview
This PR removes explicit dataset-centroid handling from most PQ interfaces, shifts compatibility to the storage/load path by folding legacy centroids into pivots, and simplifies downstream consumers so they can treat PQ tables as centroid-free. In the broader codebase, this moves PQ closer to the diskann-quantization abstractions and removes L2-only centering behavior from general-purpose interfaces.
Changes:
- Removes centroid fields/parameters from
FixedChunkPQTable,GeneratePivotArguments, and the_membufPQ construction/compression APIs. - Updates PQ storage read/write paths to preserve the on-disk layout and recover legacy behavior by folding non-zero stored centroids into pivots on load.
- Simplifies disk PQ consumers to always use
TransposedTable, and updates construction/save call sites across providers, disk builder code, tools, and benchmarks.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
diskann-tools/src/utils/build_pq.rs |
Updates tool-side PQ pivot generation call to the new generate_pq_pivots signature. |
diskann-providers/src/storage/pq_storage.rs |
Adds optional-centroid write path, legacy-centroid folding on load, and storage round-trip tests. |
diskann-providers/src/model/pq/pq_construction.rs |
Removes centroid handling from in-memory PQ APIs and adds legacy-centering toggle to file-based pivot generation. |
diskann-providers/src/model/pq/generate_pivot_arguments.rs |
Removes centroid-centering state from PQ generation arguments. |
diskann-providers/src/model/pq/fixed_chunk_pq_table.rs |
Drops centroid storage/processing from the PQ table type and updates related tests/helpers. |
diskann-providers/src/model/pq/distance/test_utils.rs |
Updates PQ test helpers to construct centroid-free tables. |
diskann-providers/src/model/pq/distance/l2.rs |
Stops query preprocessing from subtracting centroids before lookup-table generation. |
diskann-providers/src/model/graph/provider/async_/memory_quant_vector_provider.rs |
Saves PQ pivots without a centroid payload and updates test construction. |
diskann-providers/src/model/graph/provider/async_/inmem/provider.rs |
Refreshes public doc examples for the new FixedChunkPQTable::new shape. |
diskann-providers/src/model/graph/provider/async_/fast_memory_quant_vector_provider.rs |
Switches async fast in-memory PQ provider save/test paths to centroid-free tables. |
diskann-providers/src/model/graph/provider/async_/experimental/multi_pq_async.rs |
Updates experimental multi-PQ compression call sites to the new membuf API. |
diskann-providers/src/model/graph/provider/async_/bf_tree/quant_vector_provider.rs |
Adjusts BF-tree quant provider tests to the centroid-free table constructor. |
diskann-providers/src/model/graph/provider/async_/bf_tree/provider.rs |
Saves BF-tree PQ pivots without centroids and updates related tests. |
diskann-providers/src/index/diskann_async.rs |
Removes centroid buffers from async index PQ training and test setup. |
diskann-disk/src/storage/quant/pq/pq_generation.rs |
Threads the new legacy-centering flag through disk PQ generation. |
diskann-disk/src/storage/quant/pq/pq_dataset.rs |
Replaces the PQTable enum with always-transposed storage in PQData. |
diskann-disk/src/storage/quant/pq/mod.rs |
Removes the now-obsolete PQTable re-export. |
diskann-disk/src/storage/quant/mod.rs |
Stops re-exporting PQTable from the public quant module. |
diskann-disk/src/search/pq/quantizer_preprocess.rs |
Simplifies PQ preprocessing to a single TransposedTable path. |
diskann-disk/src/build/builder/quantizer.rs |
Saves builder-generated PQ pivots without a centroid payload. |
diskann-benchmark/src/backend/exhaustive/product.rs |
Updates benchmark-side PQ table construction to the centroid-free constructor. |
Comments suppressed due to low confidence (4)
diskann-providers/src/model/pq/pq_construction.rs:166
- This public
_membufentry point no longer exposes the centroid output buffer, which breaks existing callers that still use the 0.50.x signature. The PR description calls out caller-side migration, but the crate versioning policy says these packages obey SemVer, so this signature change needs a shim/deprecation path or a major release.
pub fn generate_pq_pivots_from_membuf<T: Copy + Into<f32>>(
parameters: &GeneratePivotArguments,
train_data_slice: &[T],
offsets: &mut [usize],
full_pivot_data: &mut [f32],
rng: &mut (impl Rng + ?Sized),
cancellation_token: &mut bool,
pool: RayonThreadPoolRef<'_>,
diskann-providers/src/model/pq/pq_construction.rs:166
- The doc comment for this public function still describes the removed
make_zero_mean/centroidparameters, so it no longer matches the actual API. Anyone reading the generated docs will be told to pass buffers that this signature no longer accepts.
pub fn generate_pq_pivots_from_membuf<T: Copy + Into<f32>>(
parameters: &GeneratePivotArguments,
train_data_slice: &[T],
offsets: &mut [usize],
full_pivot_data: &mut [f32],
rng: &mut (impl Rng + ?Sized),
cancellation_token: &mut bool,
pool: RayonThreadPoolRef<'_>,
diskann-providers/src/model/pq/pq_construction.rs:80
- Adding the new
legacy_center_dataparameter changes the signature of a public, re-exported function. That is another SemVer break fordiskann-providersconsumers on 0.50.x, so this needs a compatibility wrapper or a major-version plan instead of replacing the existing API in place.
pub fn generate_pq_pivots<Storage, Random>(
parameters: GeneratePivotArguments,
legacy_center_data: bool,
train_data: &mut [f32],
pq_storage: &PQStorage,
storage_provider: &Storage,
random_provider: RandomProvider<Random>,
pool: RayonThreadPoolRef<'_>,
diskann-providers/src/storage/pq_storage.rs:98
- Changing
write_pivot_datafrom taking&[f32]toOption<&[f32]>is a source-breaking public API change fordiskann-providersconsumers. Because the workspace is still on a SemVer-governed 0.50.x line, this needs a compatibility layer or a coordinated major-version bump rather than replacing the signature in place.
pub fn write_pivot_data<Storage>(
&self,
full_pivot_data: &[f32],
centroid: Option<&[f32]>,
chunk_offsets: &[usize],
num_centers: usize,
dim: usize,
storage_provider: &Storage,
) -> ANNResult<()>
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1010 +/- ##
==========================================
- Coverage 89.51% 89.50% -0.02%
==========================================
Files 461 459 -2
Lines 85920 85549 -371
==========================================
- Hits 76914 76567 -347
+ Misses 9006 8982 -24
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
harsha-simhadri
left a comment
There was a problem hiding this comment.
ok with this. could you please run on a few medium size datasets and check recall before merging. Thanks.
arkrishn94
left a comment
There was a problem hiding this comment.
Thanks for doing this Mark, lgtm. The only comment I had was on supporting the backward compatibility of generate_pq_pivots for diskann-disk tests. If you can create a small issue/sub-issue on this, we can close it in a quick follow up.
Okay, I did a few runs on Sift-1M and OpenAI-100K (3k dim). Since the latter is already zero centered, I did kind a pathological thing and added 128 to each dimension and scaled each dimension by 256. Since the average magnitude of an open-ai vector is around 0.014, there's just not very much signal if left as-is. I figure this is probably a worst case for any kind of reasonable vector. Below is an (AI generated, human verified) summary of the results:
Search Configuration (all jobs)
JSON{
"search_directories": [
"<snip>"
],
"output_directory": null,
"jobs": [
{
"type": "graph-index-build-pq",
"content": {
"index_operation": {
"search_phase": {
"groundtruth": "sift-1m.gt.bin",
"num_threads": [
10
],
"queries": "sift-1m.queries.u8.bin",
"reps": 1,
"runs": [
{
"recall_k": 10,
"search_l": [
10,
20,
30,
40,
50,
60,
70,
80,
90,
100
],
"search_n": 10
}
],
"search-type": "topk"
},
"source": {
"alpha": 1.2000000476837158,
"backedge_ratio": 1.0,
"data": "sift-1m.u8.bin",
"data_type": "uint8",
"distance": "squared_l2",
"index-source": "Build",
"insert_retry": null,
"l_build": 50,
"max_degree": 32,
"multi_insert": null,
"num_threads": 10,
"save_path": null,
"start_point_strategy": "medoid"
}
},
"max_fp_vecs_per_prune": 0,
"num_pq_chunks": 32,
"seed": 13076402859301299683,
"use_fp_for_search": false
}
},
{
"type": "graph-index-build-pq",
"content": {
"index_operation": {
"search_phase": {
"groundtruth": "sift-1m.gt.bin",
"num_threads": [
10
],
"queries": "sift-1m.queries.u8.bin",
"reps": 1,
"runs": [
{
"recall_k": 10,
"search_l": [
10,
20,
30,
40,
50,
60,
70,
80,
90,
100
],
"search_n": 10
}
],
"search-type": "topk"
},
"source": {
"alpha": 1.2000000476837158,
"backedge_ratio": 1.0,
"data": "sift-1m.u8.bin",
"data_type": "uint8",
"distance": "squared_l2",
"index-source": "Build",
"insert_retry": null,
"l_build": 50,
"max_degree": 32,
"multi_insert": null,
"num_threads": 10,
"save_path": null,
"start_point_strategy": "medoid"
}
},
"max_fp_vecs_per_prune": 0,
"num_pq_chunks": 16,
"seed": 13076402859301299683,
"use_fp_for_search": false
}
},
{
"type": "graph-index-build-pq",
"content": {
"index_operation": {
"search_phase": {
"groundtruth": "gt-100k.bin",
"num_threads": [
10
],
"queries": "query.bin",
"reps": 1,
"runs": [
{
"recall_k": 10,
"search_l": [
10,
20,
30,
40,
50,
60,
70,
80,
90,
100
],
"search_n": 10
}
],
"search-type": "topk"
},
"source": {
"alpha": 1.2000000476837158,
"backedge_ratio": 1.0,
"data": "base-100k.bin",
"data_type": "float32",
"distance": "squared_l2",
"index-source": "Build",
"insert_retry": null,
"l_build": 50,
"max_degree": 32,
"multi_insert": null,
"num_threads": 10,
"save_path": null,
"start_point_strategy": "medoid"
}
},
"max_fp_vecs_per_prune": 0,
"num_pq_chunks": 384,
"seed": 13076402859301299683,
"use_fp_for_search": false
}
},
{
"type": "graph-index-build-pq",
"content": {
"index_operation": {
"search_phase": {
"groundtruth": "gt-100k.bin",
"num_threads": [
10
],
"queries": "query.bin",
"reps": 1,
"runs": [
{
"recall_k": 10,
"search_l": [
10,
20,
30,
40,
50,
60,
70,
80,
90,
100
],
"search_n": 10
}
],
"search-type": "topk"
},
"source": {
"alpha": 1.2000000476837158,
"backedge_ratio": 1.0,
"data": "base-100k.bin",
"data_type": "float32",
"distance": "squared_l2",
"index-source": "Build",
"insert_retry": null,
"l_build": 50,
"max_degree": 32,
"multi_insert": null,
"num_threads": 10,
"save_path": null,
"start_point_strategy": "medoid"
}
},
"max_fp_vecs_per_prune": 0,
"num_pq_chunks": 300,
"seed": 13076402859301299683,
"use_fp_for_search": false
}
},
{
"type": "graph-index-build-pq",
"content": {
"index_operation": {
"search_phase": {
"groundtruth": "gt-100k.bin",
"num_threads": [
10
],
"queries": "query-100k.shifted.bin",
"reps": 1,
"runs": [
{
"recall_k": 10,
"search_l": [
10,
20,
30,
40,
50,
60,
70,
80,
90,
100
],
"search_n": 10
}
],
"search-type": "topk"
},
"source": {
"alpha": 1.2000000476837158,
"backedge_ratio": 1.0,
"data": "base-100k.shifted.bin",
"data_type": "float32",
"distance": "squared_l2",
"index-source": "Build",
"insert_retry": null,
"l_build": 50,
"max_degree": 32,
"multi_insert": null,
"num_threads": 10,
"save_path": null,
"start_point_strategy": "medoid"
}
},
"max_fp_vecs_per_prune": 0,
"num_pq_chunks": 384,
"seed": 13076402859301299683,
"use_fp_for_search": false
}
},
{
"type": "graph-index-build-pq",
"content": {
"index_operation": {
"search_phase": {
"groundtruth": "gt-100k.bin",
"num_threads": [
10
],
"queries": "query-100k.shifted.bin",
"reps": 1,
"runs": [
{
"recall_k": 10,
"search_l": [
10,
20,
30,
40,
50,
60,
70,
80,
90,
100
],
"search_n": 10
}
],
"search-type": "topk"
},
"source": {
"alpha": 1.2000000476837158,
"backedge_ratio": 1.0,
"data": "base-100k.shifted.bin",
"data_type": "float32",
"distance": "squared_l2",
"index-source": "Build",
"insert_retry": null,
"l_build": 50,
"max_degree": 32,
"multi_insert": null,
"num_threads": 10,
"save_path": null,
"start_point_strategy": "medoid"
}
},
"max_fp_vecs_per_prune": 0,
"num_pq_chunks": 300,
"seed": 13076402859301299683,
"use_fp_for_search": false
}
}
]
} |
…m_row_inplace` (#976) .Couple of small independent cleanups for PQ. ## Move `calculate_chunk_offsets[_auto]` to `ChunkOffsets` as a constructor. These two functions are pure prefix-sum math over `dimensions` and `num_pq_chunks`. `ChunkOffsetsBase` / `ChunkOffsetsView` in `diskann-quantization::views` is the natural home for it. I've moved the logic into two constructors - `from_dim` and `from_dim_into` for `ChunkOffsets` and `ChunkOffsetsView` respectively to support both allocation patterns. All in-repo call sites have been updated. There might be some overlapping edits with #1010. ## Minor changes - `QueryComputer` and `DistanceComputer` had six trampoline impls forwarding `&Vec<u8>` and `&&[u8]` arguments to the canonical `&[u8]` impls. `ElementRef` in the accessor now allows us to get rid of these! This might be conflicting with #1008 - Inline `accum_row_inplace` from `diskann-providers\src\model\pq` at the two call-sites. - Move `get_chunk_from_training_data` in pq_construction.rs from public API into tests where it is used.
# DiskANN v0.53.0 Release Notes ## Breaking Changes An AI generated, human reviewed list of changes is summarized below. ### Paged search overhauled — channel-based API ([#1078](#1078)) `PagedSearchState` and its `'static`-bound pause/resume model have been replaced with an async, channel-based interface. The recommended way to drive paged search is now via a `tokio::sync::mpsc` channel, with the searcher embedded in an otherwise-`'static` future. See the [rendered RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/01078-paged-search.md) for the new shape. Callers wired against `PagedSearchState` must migrate to the channel API. Users of paged search via `wrapped_async::DiskANNIndex` that know their inner futures will never suspend can use the new `wrapped_async::DiskANNIndex::paged_search_no_await`; this will efficiently run paged searches with minimal synchronization overhead. ### `DiskANNIndex::flat_search` removed ([#1076](#1076)) `DiskANNIndex::flat_search` and the `IdIterator` trait have been removed from the `diskann` crate. Equivalent functionality lives on the new inherent method `DiskIndexSearcher::flat_search` in `diskann-disk`. This unblocks the experimental directions in #1067 and #983. ```rust // Before diskann_index.flat_search(query, ...)?; // After disk_index_searcher.flat_search(query, ...).await?; ``` ### `DiskIndexSearcher::flat_search` now batched ([#1097](#1097)) The new `DiskIndexSearcher::flat_search` uses the bulk `pq_distances` path instead of one-vector-at-a-time `Accessor::build_query_computer` + `evaluate_similarity`. Downstream behavior is equivalent but tighter resource bounds apply. ### `centroid` removed from PQ interfaces ([#1010](#1010)) The dataset-centroid argument has been removed from `FixedChunkPQTable` construction, `populate`, and most other PQ APIs. The shift only ever worked for L2 distance and was silently ignored for inner-product / cosine, so passing it was a footgun. When an L2 shift is required, fold it into the PQ pivots instead (the library now does this internally). ```rust // Before let table = FixedChunkPQTable::new(.., centroid, ..); // After — drop the centroid argument let table = FixedChunkPQTable::new(.., ..); ``` ### Flat search interface ([#983](#983)) A new `flat` module in `diskann` adds a provider-agnostic brute-force search surface, mirroring the shape of graph search. Backends implement a single trait, `DistancesUnordered<C>` (in `flat/strategy.rs`), which fuses iteration and distance computation, allowing any backend (in-memory, quantized, disk, remote) to plug into a shared algorithm. See the [rendered RFC](https://github.com/microsoft/DiskANN/blob/main/rfcs/00983-flat-search.md). This is additive but is the new canonical surface — direct ad-hoc flat-search call sites should migrate. ### `bf_tree` extracted into `diskann-bftree` crate ([#1020](#1020)) The bf_tree provider has been moved out of `diskann-providers` (previously at `diskann-providers/src/model/graph/provider/async_/bf_tree/`) into a new standalone `diskann-bftree` crate. Along with the move: - Switched from PQ to spherical quantization. - Dropped dependencies on `DeletionCheck`, `AsDeletionCheck`, and `RemoveDeletedIdsAndCopy`. - Simplified generics. Consumers must update their `Cargo.toml` to depend on `diskann-bftree` and update import paths. ### `direct_distance_impl` and `inner_product_raw` re-exposed ([#1081](#1081)) `direct_distance_impl` (free function) and `FixedChunkPQTable::inner_product_raw` are `pub` again after being privatized in #1044. Restored to unblock a downstream user. Not breaking in the typical direction — this restores previously available API surface. ### MinMax `recompress` takes a grid-scale parameter ([#1109](#1109)) The MinMax `recompress` API now accepts a grid-scale parameter. ## New Features - SIMD-optimized L2-squared norm ([#1107](#1107)) - Significantly faster bitmap computation ([#1099](#1099)) - Large speedup on the bitmap construction path used by filtered search. - LLVM IR bloat regression check in CI ([#1083](#1083)) - CI now flags regressions in generated LLVM IR size, helping catch unintended monomorphization blow-ups. - Recall computation fix for under-k groundtruth ([#1069](#1069)) ## Merged PRs * Revise README for DiskANN3 by @harsha-simhadri in #1046 * [CI] Try to fix publishing step by @hildebrandmw in #1057 * [benchmark] Remove `DispatchRule` by @hildebrandmw in #1064 * [benchmark] Automatic Input Registration by @hildebrandmw in #1066 * Remove centroid from most PQ interfaces by @hildebrandmw in #1010 * [diskann/disk] Remove `flat_search` from `DiskANNIndex` by @hildebrandmw in #1076 * macos build and miri check to nightly by @harsha-simhadri in #1058 * [API] Make some methods public again by @hildebrandmw in #1081 * [benchmark] Simply `Inputs` more by @hildebrandmw in #1077 * Turn on stack protection for the diskann-garnet NuGet build by @jackmoffitt in #1082 * Fix options for diskann-garnet nuget pipeline by @jackmoffitt in #1091 * [CI] add LLVM IR bloat regression check by @arazumov in #1083 * Bump openssl from 0.10.79 to 0.10.80 by @dependabot[bot] in #1093 * [Disk CI benchmarks] Use 1ES.Pool=diskann-github by @arazumov in #869 * Fix recall computation for fewer than k groundtruth results by @magdalendobson in #1069 * bf_tree migration away from diskann-providers by @JordanMaples in #1020 * [RFC/diskann] Overhaul paged search by @hildebrandmw in #1078 * Remove unsafe code from compute_vec_l2sq by @arazumov in #1094 * Remove direct accessor call in `diskann-garnet` by @hildebrandmw in #1098 * Refactor `DiskIndexSearcher::flat_search` to use batching by @hildebrandmw in #1097 * [flat index] Flat Search Interface by @arkrishn94 in #983 * migrating multi-hop tests from diskann-providers to diskann by @JordanMaples in #928 * Significantly speed up bitmap computation by @magdalendobson in #1099 * `compute_vecs_l2sq`: Replace scalar L2 Squared norm with SIMD-optimized FastL2NormSquared by @arazumov in #1107 * [minmax] Add grid scaling to recompress API by @arkrishn94 in #1109 **Full Changelog**: v0.52.0...v0.53.0
Remove the "centroid" from most PQ interfaces. This centroid is/was the dataset centroid and served to shift a vector before compressing or populating the distance lookup table. Here are the problems:
This kind of shifting only works when computing L2 distances. This kind of shift does not preserve inner products or cosine similarity. So it is a little bit of a foot-gun and was already not being applied when these distances were being computed, leading to a somewhat confusing scenario where a centroid can be provided to the
FixedChunkPQTableconstructor and then ignored.When L2 distances are being used - application of the centroid can be done implicitly by adding it to the PQ pivots rather than subtracting it from each vector that gets processed. In this way, it always gets applied.
It's possible this has a small effect on training by conditioning the data a little better, but when we've applied this empirically in the past, it has not made a difference.
Carrying around this centroid is making it awkward to cleanly move PQ to
diskann-quantization, so this PR takes the first steps towards removing it.Logical Flow/Review Order
diskann_providers/src/model/pq/generate_pivot_arguments.rs:translate_to_centeris removed fromGeneratePivotArguments.diskann_providers/src/model/pq/pq_construction.rs: The functiongenerate_pq_pivotsgains alegacy_center_databoolean argument to preserve the behavior ofdiskann-disk. This is necessary because tests indiskann-diskcheck against hard-coded pivots. Not centering the data results in different PQ codes due to slightly different floating point calculations, breaking these tests.Data translation is completely removed from the
_membufAPIs.diskann_providers/src/model/pq/fixed_chunk_pq_table.rs: Remove centroids altogether fromFixedChunkPQTable.diskann-providers/src/storage/pq_storage.rs: This is where backwards compatibility comes in. We cannot change the layout of the storage file and we also need to support previously generated files. To that end,load_pq_pivots_binis updated to check if the loaded centroid is non-zero. If so, it is accumulated into the pivots to restore the original numerical behavior. Several tests are added to ensure this behavior.For saving, when no centroid is available, it can be passed to
write_pivot_dataas anOption::Noneand will then be inserted as zeros into the saved file.diskann-disk/src/search/pq/pq_dataset.rsanddiskann-disk/src/search/pq/quantizer_preprocess.rs: Preprocessing can now be greatly simplified. Because theFixedChunkPQTableno longer has centroids, theTransposedTable(offering much faster pre-processing times) can always be used.The rest of the changes involve updating the construction sites of
FixedChunkPQTableand invoking the new saving interface.Backwards Compatibility
There exist PQ schemas in the wild that were generated with centroids and we still need to correctly process those.
diskann-disk: PQ schemas previously generated I believe have been saved withPQStorage. Thus, reloading into aFixedChunkPQTablewithPQStorage::load_pq_pivots_binwill merge the centroid with the PQ pivots._membufAPI: Backwards compatibility is handled by the caller (the only real user I know of uses this via FFI, and I will take responsibility for updating that). Again, this can be done by simply folding the centroid into the pivot data, or manually doing the centering. Both of which are quite easy to do.