[PQ Cleanup] Part 2: Consolidate calculate_chunk_offsets* and accum_row_inplace #976
[PQ Cleanup] Part 2: Consolidate calculate_chunk_offsets* and accum_row_inplace #976arkrishn94 wants to merge 15 commits intomainfrom
calculate_chunk_offsets* and accum_row_inplace #976Conversation
Relocates the object pool module so that it is available to crates that depend on diskann-utils but not diskann (notably diskann-quantization, which will gain pool-aware distance-table allocation in a follow-up). diskann::utils::object_pool stays as a re-export for backwards compatibility. Direct importers in diskann-providers, diskann-disk, and diskann-garnet are switched to use diskann_utils::object_pool directly. Internal diskann users continue to use the re-export.
calculate_chunk_offsets* and remove redundant distance impls and calculate_chunk_offsets* and remove redundant distance impls and
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Relocates PQ-related chunk offset helpers into diskann-quantization, centralizes the cosine→L2 fallback rationale for disk PQ preprocessing, and removes redundant distance trampoline impls by leaning on accessor element refs.
Changes:
- Moved
calculate_chunk_offsets[_auto]fromdiskann-providersintodiskann-quantization::viewsand updated call sites. - Migrated
object_poolusage todiskann-utils::object_poolacross crates and removeddiskann::utils::object_pool. - Deduplicated cosine/L2 fallback commentary and removed redundant distance impls.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann/src/utils/mod.rs | Removes utils::object_pool module exposure. |
| diskann/src/graph/search/scratch.rs | Switches AsPooled import to diskann_utils. |
| diskann/src/graph/index.rs | Switches ObjectPool/PooledRef import to diskann_utils. |
| diskann-utils/src/lib.rs | Exposes object_pool module publicly from diskann-utils. |
| diskann-quantization/src/views.rs | Adds calculate_chunk_offsets[_auto] helpers. |
| diskann-providers/src/model/pq/pq_construction.rs | Removes local chunk-offset helpers and imports from quantization crate. |
| diskann-providers/src/model/pq/mod.rs | Stops re-exporting chunk-offset helpers from providers PQ module. |
| diskann-providers/src/model/pq/distance/test_utils.rs | Updates import path for calculate_chunk_offsets_auto. |
| diskann-providers/src/model/pq/distance/l2.rs | Switches object pool import to diskann_utils. |
| diskann-providers/src/model/pq/distance/innerproduct.rs | Switches object pool import to diskann_utils. |
| diskann-providers/src/model/pq/distance/dynamic.rs | Switches object pool import and removes redundant trait impls. |
| diskann-providers/src/model/pq/distance/common.rs | Switches object pool import to diskann_utils. |
| diskann-providers/src/model/mod.rs | Removes re-export of calculate_chunk_offsets_auto. |
| diskann-providers/src/model/graph/provider/async_/memory_quant_vector_provider.rs | Switches object pool import to diskann_utils. |
| diskann-providers/src/model/graph/provider/async_/fast_memory_quant_vector_provider.rs | Switches object pool import and updates tests for new argument types. |
| diskann-providers/src/model/graph/provider/async_/bf_tree/quant_vector_provider.rs | Switches object pool import to diskann_utils. |
| diskann-garnet/src/provider.rs | Switches object pool imports to diskann_utils. |
| diskann-disk/src/search/provider/disk_provider.rs | Switches object pool imports to diskann_utils. |
| diskann-disk/src/search/pq/quantizer_preprocess.rs | Centralizes cosine→L2 fallback rationale into module docs. |
| diskann-benchmark/src/backend/exhaustive/product.rs | Updates chunk-offset helper call site to diskann-quantization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub fn calculate_chunk_offsets(dimensions: usize, num_pq_chunks: usize, offsets: &mut [usize]) { | ||
| // Calculate each chunk's offset | ||
| // If we have 8 dimension and 3 chunks then offsets would be [0,3,6,8] | ||
| let mut chunk_offset: usize = 0; | ||
| offsets[0] = chunk_offset; | ||
| for chunk_index in 0..num_pq_chunks { | ||
| chunk_offset += dimensions / num_pq_chunks; | ||
| if chunk_index < (dimensions % num_pq_chunks) { | ||
| chunk_offset += 1; | ||
| } | ||
| offsets[chunk_index + 1] = chunk_offset; | ||
| } | ||
| } |
There was a problem hiding this comment.
As a public helper, this can panic with an unhelpful message when num_pq_chunks == 0 (division/mod by zero) or when offsets.len() != num_pq_chunks + 1 (out-of-bounds on offsets[0] / offsets[chunk_index + 1]). Consider adding an explicit check (e.g., assert!(num_pq_chunks > 0, ...) and assert_eq!(offsets.len(), num_pq_chunks + 1, ...)) or changing the API to return a Result with a clear error.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #976 +/- ##
==========================================
- Coverage 89.55% 89.50% -0.05%
==========================================
Files 459 460 +1
Lines 85006 85489 +483
==========================================
+ Hits 76131 76521 +390
- Misses 8875 8968 +93
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…ng_data to test helper
calculate_chunk_offsets* and remove redundant distance impls and calculate_chunk_offsets* and remove redundant distance impls
…up-3 # Conflicts: # diskann-disk/src/storage/quant/pq/pq_generation.rs
calculate_chunk_offsets* and remove redundant distance implscalculate_chunk_offsets* and accum_row_inplace
hildebrandmw
left a comment
There was a problem hiding this comment.
Turns out you updated the code midway through the review 😄
That said, I think my comments still hold.
| pub fn from_dimensions( | ||
| dimensions: usize, | ||
| num_pq_chunks: usize, | ||
| ) -> Result<Self, ChunkOffsetError> { |
There was a problem hiding this comment.
In the spirit of bike-shedding, what about something like
pub fn partition(
dim: NonZeroUsize,
len: NonZerousize,
) -> Result<Sel, ChunkOffsetError> {
}Which would be internally consistent with the dim()/len() methods on chunk_offsets.
Also - bonus points for erroring before allocating if at all possible. The final result can still go through Self::new to make sure all the invariants are upheld, but if we can tell right away that things are not going to work out, might as well skip some unnecessary work.
I'm not completely set on NonZeroUsize, but do consider using bespoke errors here. Right now, if you pass num_pq_chunks == 0, the error is "offsets must have a length of at least 2, found 1" which is a little confusing. Perhaps even more confusing is when we pass num_pq_chunks > dim, in which case the error is "offsets must be strictly increasing ...".
There was a problem hiding this comment.
Thanks Mark. Made sure to catch the errors upfront and hopefully raise them correctly. Lmk if that looks good.
I'm less convinced about the naming, from_dim and from_dim_into seem reasonable? I'm not sure I'm following the naming of partition and partition_in :p
| pub fn from_dimensions_into( | ||
| dimensions: usize, | ||
| num_pq_chunks: usize, | ||
| scratch: &'a mut [usize], |
There was a problem hiding this comment.
We can derive num_pq_chunks from scratch.len(), in which case this can be
pub fn partition_in(
dim: usize,
offsets: &mut [usize],
)Which has a nice symmetry with the comment on suggesting partition for the constructing case. It would also be nice if we could avoid mutating offsets if we know there will be an invalid configuration.
| self.data.as_mut_slice().chunks_exact_mut(ncols) | ||
| } | ||
|
|
||
| /// Apply `op(row[j], &broadcast[j])` to every entry of every row, broadcasting |
There was a problem hiding this comment.
Here's the rabbit hole I went down:
First, I felt that this was far too specific targeting a very niche use case and would instead be better served with
pub fn try_map_row_mut<F, R>(&mut self, f: F) -> R
where
F: FnMut(&mut [T::Elem]) -> Result<(), R>Then I realized that this could be trivially implemented with .row_iter_mut().try_for_each(f)
At which point, this made me question whether or not this is even needed.
My suggestion is to remove it if possible since it's a very specific function that can already be done almost trivially through existing means.
There was a problem hiding this comment.
Inlined this operation at the two call-sites. Removed this from MatrixBase and the associated testing.
.Couple of small independent cleanups for PQ.
Move
calculate_chunk_offsets[_auto]toChunkOffsetsas a constructor.These two functions are pure prefix-sum math over
dimensionsandnum_pq_chunks.ChunkOffsetsBase/ChunkOffsetsViewindiskann-quantization::viewsis the natural home for it. I've moved the logic into two constructors -from_dimandfrom_dim_intoforChunkOffsetsandChunkOffsetsViewrespectively to support both allocation patterns.All in-repo call sites have been updated. There might be some overlapping edits with #1010.
Minor changes
QueryComputerandDistanceComputerhad six trampoline impls forwarding&Vec<u8>and&&[u8]arguments to the canonical&[u8]impls.ElementRefin the accessor now allows us to get rid of these! This might be conflicting with Remove unnecessary distance function implementations. #1008accum_row_inplacefromdiskann-providers\src\model\pqat the two call-sites.get_chunk_from_training_datain pq_construction.rs from public API into tests where it is used.