[RFC] Better Command-Chain Testing with Index Views

## Problem Statement
We see a lot of bugs that boil down to "Command A works, command B works, but A + B doesn't work." Recent examples are #5482 and #5483. These show clear testing gaps with combining multiple commands in one query, some of which are easy to catch (e.g. prepending a `where` clause) but not currently tested.

## Current State
We have a lot of integration correctness and performance tests that use specific test indices. These do somewhat thoroughly exercise all individual commands, but these tests rarely involve carefully checking multiple commands. We also don't have great general visibility on whether certain optimization rules are applied, so `explain` tests are locked in scope to only simple cases or hand-checked complex cases.

## Long-Term Goals
* Catch more bugs with common cross-command combinations. For example, I think all commands should work after simple-predicate `where` clauses, up to some pathological counterexample, since `where` is very easy to push down.
* Keep it simple: easy to implement, largely reuse our current testing infrastructure without complex introductions like property-based tests.
* Better model real-world use cases: complex commands are rarely used directly, usually there are some pre-filtering steps.

## Proposal
I propose implementing two general migrations. Both can be incrementally done, and I believe they're mechanical enough to largely delegate to agents:
1. Introduce Views as additional options for test indices
2. Add optimization annotations to `_explain` output

### Test Views

We currently implement all of our integ tests on top of test indices. The largest indices are `TEST_INDEX_ACCOUNT` (accounting for 19.4% of all E2E query tests), followed by `TEST_INDEX_BANK` (15.7%) and `TEST_INDEX_STATE_COUNTRY` (5.5%). This means that, in order to test any given command with another command, we need to manually work out a query that exercises the new command, as well as its output.

Taking the idea of a View in SQL, we can instead use an approach where we store and reuse query parts as test indices.

Suppose we have this index for tests, with many tests relying on it:

```json
{"a": 1, "b": "cat"}
{"a": 2, "b": "dog"}
```

Now let's extend it with some fake additional fields and records:

```json
{"a": 1, "b": "cat", "_is_real": true}
{"a": 2, "b": "dog", "_is_real": true}
{"a": 3, "b": "duck", "_is_real": false}
{"a": 4, "b": "goose", "_is_real": false}
```

And we create a view: `source=extended_idx | where _is_real | fields - _is_real`

Now, for all tests that use the original `idx`, we can parametrize these tests: for both `idx` and `idx_is_real_view`, we can run the same test and expect the same results.
### Optimization Annotations

Test views work for _correctness_ tests as-is, but not _optimization_ tests. A lot of our current optimization tests are based on hardcoding `_explain` output. To support testing optimizations (e.g. pushdowns) in diverse contexts, we can improve the situation by adding context on applied optimizations in the explain output itself. i.e. current explain output resembles:
```yaml
sourceBuilder:
  query:
    bool:
      must:
        - range: { "@timestamp": { "gt": "..." } }
```

We can add a new section for optimizations that includes basic metadata:
```yaml
sourceBuilder:
  query:
    bool:
      must:
        - range: { "@timestamp": { "gt": "..." } }
_optimization_metadata:
  rules_applied:
    - FilterIndexScanRule (pushed: range on @timestamp)
    - DedupPushdownRule (pushed: composite aggregation)
  rules_attempted:
    - RareTopPushdownRule (failed: no rare/top pattern)
```

Now any tests that want to validate specific optimizations are applying can do something like `assert(plan.contains("pushed: range on @timestamp"))`.

## Approach
To do this incrementally, here's a quick sketch of an approach.

**For views:**
1. Extend our top 3 test indices with dummy values, courtesy of your friendly neighborhood data generator. Call these `EXTENDED_{index_name}` or something.
2. Create views which take `EXTENDED_...` and produce the original index, prioritizing the top indices (mostly `accounts` and `bank`). Some candidates for views:
    1. The trivial view (`source=index`)
	2. WHERE on temporal fields (incredibly common due to OSD's filter behavior)
	3. EVAL -> WHERE chains (simple derived data)
	4. STATS with WHERE
	5. Something involving `join`
	6. Something involving `sort`
    7. Something involving `rex`
3. Create some utils to make it easy to run queries on views instead of indices.
4. Migrate some correctness tests to be parametrized over these views -- can largely be done in bulk by your favorite coding agent, most of these should boil down to changing a string formatting expression. We can also do this incrementally, just prioritizing commands that seem most error-prone (e.g. dedup).

To create these views, we can probably do this with creative use of fields like `_is_real`, `_timestamp`, `_shard_key`, etc. Easy cases can have these extra fields filtered out with `fields - _*`, harder cases might take more work (e.g. self-join probably takes random fields from both sides).

Incidentally, switching to views that have specific field orderings can help us avoid some historic cases with flaky tests due to field ordering.

**For `_explain`:**
1. Implement the core optimization rules annotation system
2. Onboard a couple rules to the annotation
3. Refactor existing explain tests to check for the rules' presence in the annotations (again, can be done in bulk and incrementally).

The `_explain` tests will be tricky due to all our hardcoding, while we're at it we should try and refactor out these string assertions directly and figure out how to only check specific elements we care about. Seems like good timing.

## Alternatives
- Explore something with dynamically generating test cases, building on the work from #3220. I do have some work-in-progress experimental code where I revisited the core ideas from that RFC with a new lens, might mess with it more some time.
- Generally revisit and potentially improve our optimization system -- I have another RFC for this on the way, it just needs more research & revision.
- Just add more chain tests manually.

I'm generally in favor of this approach because it's very easy to do and lets us reuse a lot of existing tests with only minimal rework. I do think with that last point we have opportunities to do better (including more queries that model realistic use cases with messy input data), but it's a lot less bang for buck in bulk.

## Implementation Discussion
See "Approach."


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Better Command-Chain Testing with Index Views #5505

Problem Statement

Current State

Long-Term Goals

Proposal

Test Views

Optimization Annotations

Approach

Alternatives

Implementation Discussion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC] Better Command-Chain Testing with Index Views #5505

Description

Problem Statement

Current State

Long-Term Goals

Proposal

Test Views

Optimization Annotations

Approach

Alternatives

Implementation Discussion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions