Skip to content

test: add tests for spill file sizes to verify View GC#21750

Merged
adriangb merged 3 commits intoapache:mainfrom
RatulDawar:test/spill-file-size-gc
Apr 20, 2026
Merged

test: add tests for spill file sizes to verify View GC#21750
adriangb merged 3 commits intoapache:mainfrom
RatulDawar:test/spill-file-size-gc

Conversation

@RatulDawar
Copy link
Copy Markdown
Contributor

Summary

This PR adds unit tests to verify that StringView and BinaryView arrays are correctly compacted (garbage collected) before being spilled to disk.

The tests address Issue #21683 by:

  1. Creating "bloated" batches with large underlying buffers.
  2. Slicing them to a small number of rows (1%).
  3. Spilling them to disk.
  4. Asserting that the resulting file size is small (proving that GC removed the unused "ghost" data).

Test plan

  • Run cargo test -p datafusion-physical-plan --lib spill::mod::tests::test_spill_file_size_gc_verification_string_view
  • Run cargo test -p datafusion-physical-plan --lib spill::mod::tests::test_spill_file_size_gc_verification_binary_view

Made with Cursor

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Apr 20, 2026
@RatulDawar
Copy link
Copy Markdown
Contributor Author

RatulDawar commented Apr 20, 2026

Currently the 10KB threshold to trigger GC makes it a requirement to replicate that in tests, wonder if we can make it parametermized so that we can controller the threshold to trigger gc.

PS : Got to learn about about loser tree while deep diving into spill and sort implementation and a nice talk about it. 😄

@RatulDawar RatulDawar force-pushed the test/spill-file-size-gc branch from 7037620 to d31c248 Compare April 20, 2026 18:33
@RatulDawar
Copy link
Copy Markdown
Contributor Author

@adriangb please take a look at it.

Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 😄

Comment on lines +1432 to +1435
let strings: Vec<String> = (0..num_rows)
.map(|i| format!("this_is_a_long_string_to_ensure_it_is_not_inlined_and_causes_waste_{i}"))
.collect();
let string_array = StringViewArray::from(strings);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't need this later I wonder if we can avoid the allocation by feeding an iterator into the StringView or at least drop it immediately to free up memory.

Update StringView and BinaryView spill tests to use iterators instead of intermediate Vec allocations, reducing unnecessary memory overhead during test setup.

Made-with: Cursor
Made-with: Cursor
@adriangb adriangb added this pull request to the merge queue Apr 20, 2026
Merged via the queue into apache:main with commit 7acbe03 Apr 20, 2026
35 checks passed
Rich-T-kid pushed a commit to Rich-T-kid/datafusion that referenced this pull request Apr 21, 2026
## Summary
This PR adds unit tests to verify that `StringView` and `BinaryView`
arrays are correctly compacted (garbage collected) before being spilled
to disk.

The tests address [Issue
apache#21683](apache#21683) by:
1. Creating "bloated" batches with large underlying buffers.
2. Slicing them to a small number of rows (1%).
3. Spilling them to disk.
4. Asserting that the resulting file size is small (proving that GC
removed the unused "ghost" data).

## Test plan
- Run `cargo test -p datafusion-physical-plan --lib
spill::mod::tests::test_spill_file_size_gc_verification_string_view`
- Run `cargo test -p datafusion-physical-plan --lib
spill::mod::tests::test_spill_file_size_gc_verification_binary_view`

Made with [Cursor](https://cursor.com)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants