You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We see a lot of bugs that boil down to "Command A works, command B works, but A + B doesn't work." Recent examples are #5482 and #5483. These show clear testing gaps with combining multiple commands in one query, some of which are easy to catch (e.g. prepending a where clause) but not currently tested.
Current State
We have a lot of integration correctness and performance tests that use specific test indices. These do somewhat thoroughly exercise all individual commands, but these tests rarely involve carefully checking multiple commands. We also don't have great general visibility on whether certain optimization rules are applied, so explain tests are locked in scope to only simple cases or hand-checked complex cases.
Long-Term Goals
Catch more bugs with common cross-command combinations. For example, I think all commands should work after simple-predicate where clauses, up to some pathological counterexample, since where is very easy to push down.
Keep it simple: easy to implement, largely reuse our current testing infrastructure without complex introductions like property-based tests.
Better model real-world use cases: complex commands are rarely used directly, usually there are some pre-filtering steps.
Proposal
I propose implementing two general migrations. Both can be incrementally done, and I believe they're mechanical enough to largely delegate to agents:
Introduce Views as additional options for test indices
Add optimization annotations to _explain output
Test Views
We currently implement all of our integ tests on top of test indices. The largest indices are TEST_INDEX_ACCOUNT (accounting for 19.4% of all E2E query tests), followed by TEST_INDEX_BANK (15.7%) and TEST_INDEX_STATE_COUNTRY (5.5%). This means that, in order to test any given command with another command, we need to manually work out a query that exercises the new command, as well as its output.
Taking the idea of a View in SQL, we can instead use an approach where we store and reuse query parts as test indices.
Suppose we have this index for tests, with many tests relying on it:
{"a": 1, "b": "cat"}
{"a": 2, "b": "dog"}
Now let's extend it with some fake additional fields and records:
And we create a view: source=extended_idx | where _is_real | fields - _is_real
Now, for all tests that use the original idx, we can parametrize these tests: for both idx and idx_is_real_view, we can run the same test and expect the same results.
Optimization Annotations
Test views work for correctness tests as-is, but not optimization tests. A lot of our current optimization tests are based on hardcoding _explain output. To support testing optimizations (e.g. pushdowns) in diverse contexts, we can improve the situation by adding context on applied optimizations in the explain output itself. i.e. current explain output resembles:
Now any tests that want to validate specific optimizations are applying can do something like assert(plan.contains("pushed: range on @timestamp")).
Approach
To do this incrementally, here's a quick sketch of an approach.
For views:
Extend our top 3 test indices with dummy values, courtesy of your friendly neighborhood data generator. Call these EXTENDED_{index_name} or something.
Create views which take EXTENDED_... and produce the original index, prioritizing the top indices (mostly accounts and bank). Some candidates for views:
The trivial view (source=index)
WHERE on temporal fields (incredibly common due to OSD's filter behavior)
EVAL -> WHERE chains (simple derived data)
STATS with WHERE
Something involving join
Something involving sort
Something involving rex
Create some utils to make it easy to run queries on views instead of indices.
Migrate some correctness tests to be parametrized over these views -- can largely be done in bulk by your favorite coding agent, most of these should boil down to changing a string formatting expression. We can also do this incrementally, just prioritizing commands that seem most error-prone (e.g. dedup).
To create these views, we can probably do this with creative use of fields like _is_real, _timestamp, _shard_key, etc. Easy cases can have these extra fields filtered out with fields - _*, harder cases might take more work (e.g. self-join probably takes random fields from both sides).
Incidentally, switching to views that have specific field orderings can help us avoid some historic cases with flaky tests due to field ordering.
For _explain:
Implement the core optimization rules annotation system
Onboard a couple rules to the annotation
Refactor existing explain tests to check for the rules' presence in the annotations (again, can be done in bulk and incrementally).
The _explain tests will be tricky due to all our hardcoding, while we're at it we should try and refactor out these string assertions directly and figure out how to only check specific elements we care about. Seems like good timing.
Alternatives
Explore something with dynamically generating test cases, building on the work from [RFC] Dynamic Correctness Testing Framework #3220. I do have some work-in-progress experimental code where I revisited the core ideas from that RFC with a new lens, might mess with it more some time.
Generally revisit and potentially improve our optimization system -- I have another RFC for this on the way, it just needs more research & revision.
Just add more chain tests manually.
I'm generally in favor of this approach because it's very easy to do and lets us reuse a lot of existing tests with only minimal rework. I do think with that last point we have opportunities to do better (including more queries that model realistic use cases with messy input data), but it's a lot less bang for buck in bulk.
Problem Statement
We see a lot of bugs that boil down to "Command A works, command B works, but A + B doesn't work." Recent examples are #5482 and #5483. These show clear testing gaps with combining multiple commands in one query, some of which are easy to catch (e.g. prepending a
whereclause) but not currently tested.Current State
We have a lot of integration correctness and performance tests that use specific test indices. These do somewhat thoroughly exercise all individual commands, but these tests rarely involve carefully checking multiple commands. We also don't have great general visibility on whether certain optimization rules are applied, so
explaintests are locked in scope to only simple cases or hand-checked complex cases.Long-Term Goals
whereclauses, up to some pathological counterexample, sincewhereis very easy to push down.Proposal
I propose implementing two general migrations. Both can be incrementally done, and I believe they're mechanical enough to largely delegate to agents:
_explainoutputTest Views
We currently implement all of our integ tests on top of test indices. The largest indices are
TEST_INDEX_ACCOUNT(accounting for 19.4% of all E2E query tests), followed byTEST_INDEX_BANK(15.7%) andTEST_INDEX_STATE_COUNTRY(5.5%). This means that, in order to test any given command with another command, we need to manually work out a query that exercises the new command, as well as its output.Taking the idea of a View in SQL, we can instead use an approach where we store and reuse query parts as test indices.
Suppose we have this index for tests, with many tests relying on it:
{"a": 1, "b": "cat"} {"a": 2, "b": "dog"}Now let's extend it with some fake additional fields and records:
{"a": 1, "b": "cat", "_is_real": true} {"a": 2, "b": "dog", "_is_real": true} {"a": 3, "b": "duck", "_is_real": false} {"a": 4, "b": "goose", "_is_real": false}And we create a view:
source=extended_idx | where _is_real | fields - _is_realNow, for all tests that use the original
idx, we can parametrize these tests: for bothidxandidx_is_real_view, we can run the same test and expect the same results.Optimization Annotations
Test views work for correctness tests as-is, but not optimization tests. A lot of our current optimization tests are based on hardcoding
_explainoutput. To support testing optimizations (e.g. pushdowns) in diverse contexts, we can improve the situation by adding context on applied optimizations in the explain output itself. i.e. current explain output resembles:We can add a new section for optimizations that includes basic metadata:
Now any tests that want to validate specific optimizations are applying can do something like
assert(plan.contains("pushed: range on @timestamp")).Approach
To do this incrementally, here's a quick sketch of an approach.
For views:
EXTENDED_{index_name}or something.EXTENDED_...and produce the original index, prioritizing the top indices (mostlyaccountsandbank). Some candidates for views:source=index)joinsortrexTo create these views, we can probably do this with creative use of fields like
_is_real,_timestamp,_shard_key, etc. Easy cases can have these extra fields filtered out withfields - _*, harder cases might take more work (e.g. self-join probably takes random fields from both sides).Incidentally, switching to views that have specific field orderings can help us avoid some historic cases with flaky tests due to field ordering.
For
_explain:The
_explaintests will be tricky due to all our hardcoding, while we're at it we should try and refactor out these string assertions directly and figure out how to only check specific elements we care about. Seems like good timing.Alternatives
I'm generally in favor of this approach because it's very easy to do and lets us reuse a lot of existing tests with only minimal rework. I do think with that last point we have opportunities to do better (including more queries that model realistic use cases with messy input data), but it's a lot less bang for buck in bulk.
Implementation Discussion
See "Approach."