Skip to content

Explorer FTS Track 5: GO/NO-GO decision gate #172

@rdhyee

Description

@rdhyee

Updated 2026-05-08 (rounds 1 + 2 per Codex review on #165). Round 1 added quality-gate cells. Round 2 sharpened: (a) "non-empty" is necessary but not sufficient — concept-only and stopword-heavy checks now require top-K relevance, not just any results; (b) added pathological-behavior hard-fails covering tokenizer parity, all-stopword queries, duplicate terms, edge-length tokens, missing display-join rows, filter composition; (c) NO-GO framing makes hosted-search a permanent contingency for either v1 failure or future v2+ quality requirements.

Sub-issue of #165. Depends on #171 (browser query prototype + benchmark data).

Goal

Mechanical decision gate. No budget renegotiation here — the budgets in #169's SEARCH_INDEX_V1.md are the contract.

Decision criterion

Does the prototype meet ALL of the cells below — every latency/bytes target, every quality target, and every hard-fail check?

A fast-but-mediocre search that fails any quality cell or any hard-fail check is NOT a GO, regardless of how it performs on the latency/bytes table.

Performance gates (hard)

metric contract prototype pass?
cold first search (P50) ≤ 2 s (fill)
warm repeat-same-query search ≤ 500 ms (fill)
warm new-query-after-warm-up search ≤ 500 ms (fill)
filter-composed cold search ≤ 3 s (fill)
bytes transferred cold ≤ 5 MB (fill)
bytes transferred warm ≤ 1 MB (fill)

Quality gates (hard, not advisory)

metric contract prototype pass?
top-3 overlap vs hand-labeled set ≥ TBD% (fill)
top-10 overlap vs hand-labeled set ≥ TBD% (fill)
top-10 overlap vs DuckDB FTS local oracle ≥ TBD% (fill)
concept-only top-3 relevance: each of ceramic, bone, mammal (+ 1-2 more) returns ≥ 2 of 3 hand-labeled known-good PIDs 2/3 each (fill)
stopword-heavy near-equivalence: top-K Jaccard between pottery from Cyprus and pottery Cyprus (stopword-stripped form) ≥ 0.8 top-10 (fill)

Numeric thresholds get filled in once #167 baseline + #171 prototype + DuckDB FTS oracle numbers land, so we know what "beats ILIKE" and "approaches BM25 oracle" mean on the canonical query set.

Hard-fail checks (any single failure = NO-GO)

Semantics

check pass?
Concept-only queries (ceramic, bone, mammal) all return non-empty results
Concept-only queries hit the top-3 relevance bar in the quality table above
Stopword-heavy queries (pottery from Cyprus) return non-empty results
Stopword-heavy queries hit the near-equivalence bar in the quality table above
Diacritic queries (Çatalhöyük) match the diacritic-stripped index

Tokenizer + query parsing

check pass?
Tokenizer parity: Python and JS produce identical token sequences for every term in the curated benchmark (not just the regression set)
All-stopword query (a the of) yields a controlled empty state with helpful copy, not an error or a full-corpus dump
Duplicate terms (pottery pottery cyprus) produce the same top-K result identity as pottery cyprus, within ranking-order tolerance
Empty / 1-char / very-long token queries: do not fetch broad shards; return an empty or error-with-copy state without long stalls
Wildcard literals (%, _) tokenize without errors

Display + composition

check pass?
Missing display-join rows: substrate hit whose pid has no row in samples_map_lite does not crash and does not silently drop a top hit (either show with placeholder or document as known limit)
Filter composition matches a labeled expectation (one of two modes per (query, filter) pair). (a) Pair has a hand-labeled expected filtered top-K (in tests/search_benchmark.json); the substrate's filtered top-K must match it. (b) Filter is chosen such that ALL hand-labeled unfiltered top-K results satisfy it (e.g., source filter whose set covers every top-K result's source); the filtered top-K must equal the unfiltered top-K. The earlier "implicitly satisfies" wording was too loose — a top result that doesn't satisfy the filter legitimately drops out, so a raw top-K change is not necessarily a bug; the invariant has to be tied to labeled expectation. Tested on at least 3 distinct (query, filter) pairs.

Two outcomes

GO

  • All performance cells pass.
  • All quality cells pass.
  • All hard-fail checks pass.
  • Open ship issue: remove ?fts=v1 flag, route doSearch() permanently to substrate path, deprecate the ILIKE path.
  • Update query-spec.qmd:225 to describe the substrate-backed search.
  • Close Improve Interactive Explorer full-text search substrate #165 once ship issue lands.

A v1 GO does not close the hosted-search-backend question. It defers it. See NO-GO framing below for why hosted search remains a permanent contingency for v2+ requirements (richer analyzers, phrase search, typo tolerance, v2 field growth).

NO-GO

  • At least one cell fails.
  • File Explorer FTS Track 6: Hosted-search backend issue with:
    • the failed-cell data attached (which budgets, which quality, which hard-fails)
    • a starter requirements doc referencing Solr searchText semantics from query-spec.qmd:213-221
    • the DuckDB FTS local oracle numbers from Explorer FTS Track 4: Browser query prototype + benchmark #171 §5 as the relevance bar to clear
    • explicit framing: hosted-search is the answer if the static substrate is structurally limited; static-site constraint should not permanently cap search quality
  • Keep the ?fts=v1 flag in place as a measurement tool until the hosted backend lands.
  • Close Improve Interactive Explorer full-text search substrate #165 with a pointer to the hosted-search issue.

Hosted-search backend as a permanent contingency

The Track 6 hosted-search-backend issue may be triggered by either:

  • (a) v1 GO/NO-GO failure — at least one cell fails the gate above.
  • (b) Post-ship v2+ requirements — even on a v1 GO, future quality requirements (phrase search, typo tolerance, richer analyzers, v2 field growth that exceeds the static substrate's byte budget) may exceed what a static-Parquet substrate can deliver. When that happens, Track 6 fires for the same reasons the budget data would have triggered it under (a).

Both triggers file the same downstream issue with the same starter requirements doc.

Refs

#165, #169, #171

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions