Skip to content

[feature](catalog) Unify ES catalog scan through FILE_SCAN path, eliminating PluginDrivenEsScanNode#62602

Merged
morningman merged 26 commits into
apache:masterfrom
morningman:catalog-spi-2
Apr 22, 2026
Merged

[feature](catalog) Unify ES catalog scan through FILE_SCAN path, eliminating PluginDrivenEsScanNode#62602
morningman merged 26 commits into
apache:masterfrom
morningman:catalog-spi-2

Conversation

@morningman
Copy link
Copy Markdown
Contributor

@morningman morningman commented Apr 19, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #62183

Problem Summary:

Before this change, ES catalog queries went through a dedicated
PluginDrivenEsScanNodeES_HTTP_SCAN_NODEEsScanner code path,
while JDBC catalogs used PluginDrivenScanNodeFILE_SCAN_NODE
FileScanner. The two paths had completely separate plan-generation logic,
Thrift structures, and BE execution flows, making the connector SPI harder
to maintain and extend.

This PR routes ES catalog scans through the same FILE_SCAN_NODE path that
JDBC already uses, achieving a single unified scan-node implementation for
all plugin-driven connectors. The key changes are:

Thrift layer

  • Add FORMAT_ES_HTTP = 19 to TFileFormatType
  • Add es_params (per-shard map) to TTableFormatFileDesc
  • Add es_properties, es_docvalue_context, es_fields_context to
    TFileScanRangeParams (shared across shards)

BE — EsHttpReader (new, native C++)

  • New GenericReader subclass in be/src/format/table/es_http_reader.{h,cpp}
  • Wraps existing ESScanReader (HTTP scroll) and ScrollParser (JSON→columns)
  • No JVM overhead — pure C++ HTTP reader, unlike JniReader used by some formats
  • Registered in FileScanner::_get_next_reader() for FORMAT_ES_HTTP
  • Implements get_columns() and fill_all_columns() = true

FE — Connector SPI extension

  • Add ScanNodePropertiesResult class: typed wrapper carrying both scan
    properties and a notPushedConjunctIndices set with explicit
    hasConjunctTracking flag (distinguishes "no tracking" from "all pushed")
  • Add getScanNodePropertiesResult() to ConnectorScanPlanProvider

FE — ES connector adaptation

  • EsScanRange: changed to FILE_SCAN range type, added getPath(),
    getTableFormatType(), getFileFormat(), BE-compatible property keys
  • EsScanPlanProvider: builds scan node properties with query_dsl,
    doc_values_mode, auth, docvalue/fields context serialization
  • EsConnectorMetadata: added getColumnHandles() using NamedColumnHandle

FE — PluginDrivenScanNode enhancement

  • Added ES format mapping (es_httpFORMAT_ES_HTTP)
  • Added setEsParams() / setEsScanLevelParams() for ES Thrift fields
  • Added conjunct pruning via ScanNodePropertiesResult (removes pushed-down
    filters including esquery() fake function)
  • Added ES-specific EXPLAIN output

FE — Cleanup

  • Deleted PluginDrivenEsScanNode.java (390 lines)
  • Simplified PhysicalPlanTranslator: removed ES-specific branch, all
    plugin-driven connectors now go through PluginDrivenScanNode

Release note

ES catalog scans now use the unified FILE_SCAN execution path shared with
JDBC and other plugin-driven connectors. This is an internal architectural
change — query behavior and results are unchanged. The esquery() function
and doc-value optimization continue to work as before.

Check List (For Author)

  • Test: Regression test — all 7 ES test suites pass
    • test_es_query
    • test_es_query_nereids
    • test_es_flatten_type
    • test_es_keyword_array_type
    • test_es_query_no_http_url
    • test_es_query_predicate_correctness
    • test_es_catalog_http_open_api
  • Behavior changed: No
  • Does this need documentation: No

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 19, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman morningman changed the title Catalog spi 2 [feature](catalog) Unify ES catalog scan through FILE_SCAN path, eliminating PluginDrivenEsScanNode Apr 19, 2026
@morningman
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 12.42% (19/153) 🎉
Increment coverage report
Complete coverage report

morningman added a commit to morningman/doris that referenced this pull request Apr 20, 2026
… member, delete ES_SCAN enum

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62602

Problem Summary: Address P2 code quality issues identified in PR review:
- P2-A: Four inline `new ObjectMapper()` calls in PluginDrivenScanNode create
  unnecessary garbage; replaced with static OBJECT_MAPPER + MAP_TYPE_REF fields
- P2-B: Unused `_es_properties` member in EsHttpReader.h wastes memory
- P2-D: `ES_SCAN` enum value in ConnectorScanRangeType is unreferenced after
  ES was unified into FILE_SCAN path; deleted to avoid confusion
- P2-E: Thrift comment for es_params listed wrong key names (`es.index` etc.);
  corrected to actual keys (`index`, `type`, `shard_id`, `host_port`, `es_hosts`)

### Release note

None

### Check List (For Author)

- Test: Regression test (all 7 ES suites pass) / FE + BE build verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@morningman morningman requested a review from CalvinKirs as a code owner April 20, 2026 01:12
morningman added a commit to morningman/doris that referenced this pull request Apr 20, 2026
… member, delete ES_SCAN enum

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62602

Problem Summary: Address P2 code quality issues identified in PR review:
- P2-A: Four inline `new ObjectMapper()` calls in PluginDrivenScanNode create
  unnecessary garbage; replaced with static OBJECT_MAPPER + MAP_TYPE_REF fields
- P2-B: Unused `_es_properties` member in EsHttpReader.h wastes memory
- P2-D: `ES_SCAN` enum value in ConnectorScanRangeType is unreferenced after
  ES was unified into FILE_SCAN path; deleted to avoid confusion
- P2-E: Thrift comment for es_params listed wrong key names (`es.index` etc.);
  corrected to actual keys (`index`, `type`, `shard_id`, `host_port`, `es_hosts`)

### Release note

None

### Check List (For Author)

- Test: Regression test (all 7 ES suites pass) / FE + BE build verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@morningman
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.02% (1842/2361)
Line Coverage 64.60% (32921/50958)
Region Coverage 65.18% (16318/25034)
Branch Coverage 55.77% (8709/15616)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 29.55% (26/88) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 89.92% (116/129) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.79% (27609/37418)
Line Coverage 57.50% (298382/518904)
Region Coverage 54.69% (248243/453938)
Branch Coverage 56.28% (107439/190903)

@morningman
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/66) 🎉
Increment coverage report
Complete coverage report

morningman added a commit to morningman/doris that referenced this pull request Apr 20, 2026
…DrivenScanNode

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62602

Problem Summary: PR apache#62602 unified ES catalog scanning through FILE_SCAN path
(EsHttpReader) but lost the terminate_after limit pushdown optimization that
existed in the old EsScanOperatorX. Without this optimization, queries like
`SELECT * FROM es_table LIMIT 10` use scroll mode to fetch all data from ES
instead of a single _search request with terminate_after=10, causing
significant performance regression.

### Release note

Restore ES terminate_after limit pushdown optimization: when a LIMIT clause
is present, all predicates are pushed down to ES, and the limit fits within
one batch, Doris now sends a single ES _search request with terminate_after
instead of scrolling all results. EXPLAIN output shows "ES terminate_after: N"
when this optimization is active.

### Check List (For Author)

- Test: Manual test / Regression test needed
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@morningman
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.06% (1843/2361)
Line Coverage 64.70% (32965/50947)
Region Coverage 65.22% (16360/25085)
Branch Coverage 55.77% (8729/15652)

@suxiaogang223
Copy link
Copy Markdown
Member

run external

morningman added a commit to morningman/doris that referenced this pull request Apr 21, 2026
… member, delete ES_SCAN enum

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62602

Problem Summary: Address P2 code quality issues identified in PR review:
- P2-A: Four inline `new ObjectMapper()` calls in PluginDrivenScanNode create
  unnecessary garbage; replaced with static OBJECT_MAPPER + MAP_TYPE_REF fields
- P2-B: Unused `_es_properties` member in EsHttpReader.h wastes memory
- P2-D: `ES_SCAN` enum value in ConnectorScanRangeType is unreferenced after
  ES was unified into FILE_SCAN path; deleted to avoid confusion
- P2-E: Thrift comment for es_params listed wrong key names (`es.index` etc.);
  corrected to actual keys (`index`, `type`, `shard_id`, `host_port`, `es_hosts`)

### Release note

None

### Check List (For Author)

- Test: Regression test (all 7 ES suites pass) / FE + BE build verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
morningman added a commit to morningman/doris that referenced this pull request Apr 21, 2026
…DrivenScanNode

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62602

Problem Summary: PR apache#62602 unified ES catalog scanning through FILE_SCAN path
(EsHttpReader) but lost the terminate_after limit pushdown optimization that
existed in the old EsScanOperatorX. Without this optimization, queries like
`SELECT * FROM es_table LIMIT 10` use scroll mode to fetch all data from ES
instead of a single _search request with terminate_after=10, causing
significant performance regression.

### Release note

Restore ES terminate_after limit pushdown optimization: when a LIMIT clause
is present, all predicates are pushed down to ES, and the limit fits within
one batch, Doris now sends a single ES _search request with terminate_after
instead of scrolling all results. EXPLAIN output shows "ES terminate_after: N"
when this optimization is active.

### Check List (For Author)

- Test: Manual test / Regression test needed
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@morningman
Copy link
Copy Markdown
Contributor Author

run buildall

morningman added a commit to morningman/doris that referenced this pull request Apr 22, 2026
… member, delete ES_SCAN enum

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62602

Problem Summary: Address P2 code quality issues identified in PR review:
- P2-A: Four inline `new ObjectMapper()` calls in PluginDrivenScanNode create
  unnecessary garbage; replaced with static OBJECT_MAPPER + MAP_TYPE_REF fields
- P2-B: Unused `_es_properties` member in EsHttpReader.h wastes memory
- P2-D: `ES_SCAN` enum value in ConnectorScanRangeType is unreferenced after
  ES was unified into FILE_SCAN path; deleted to avoid confusion
- P2-E: Thrift comment for es_params listed wrong key names (`es.index` etc.);
  corrected to actual keys (`index`, `type`, `shard_id`, `host_port`, `es_hosts`)

### Release note

None

### Check List (For Author)

- Test: Regression test (all 7 ES suites pass) / FE + BE build verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
morningman added a commit to morningman/doris that referenced this pull request Apr 22, 2026
…DrivenScanNode

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62602

Problem Summary: PR apache#62602 unified ES catalog scanning through FILE_SCAN path
(EsHttpReader) but lost the terminate_after limit pushdown optimization that
existed in the old EsScanOperatorX. Without this optimization, queries like
`SELECT * FROM es_table LIMIT 10` use scroll mode to fetch all data from ES
instead of a single _search request with terminate_after=10, causing
significant performance regression.

### Release note

Restore ES terminate_after limit pushdown optimization: when a LIMIT clause
is present, all predicates are pushed down to ES, and the limit fits within
one batch, Doris now sends a single ES _search request with terminate_after
instead of scrolling all results. EXPLAIN output shows "ES terminate_after: N"
when this optimization is active.

### Check List (For Author)

- Test: Manual test / Regression test needed
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
morningman and others added 5 commits April 21, 2026 20:37
… scan path

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add Thrift definitions to support routing ES scans through the
unified FILE_SCAN_NODE path (same as JDBC). This is the foundation for
eliminating the separate PluginDrivenEsScanNode/ES_HTTP_SCAN_NODE code path.

Changes:
- Add FORMAT_ES_HTTP = 19 to TFileFormatType enum
- Add es_params (field 12) to TTableFormatFileDesc for per-shard ES parameters
- Add es_properties (field 31), es_docvalue_context (field 32), and
  es_fields_context (field 33) to TFileScanRangeParams for per-node ES parameters

None

- Test: FE and BE compilation verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…path

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Create EsHttpReader, a native C++ GenericReader subclass that
wraps existing ESScanReader and ScrollParser to support ES scanning through the
unified FILE_SCAN_NODE path (same path as JDBC, Parquet, etc.).

Key design points:
- Extends GenericReader directly (no JNI overhead)
- Reuses existing ESScanReader (HTTP scroll API) and ScrollParser (JSON parsing)
- Reads ES parameters from TFileScanRangeParams (es_properties, es_docvalue_context,
  es_fields_context) and TTableFormatFileDesc (es_params)
- Builds query DSL via ESScrollQueryBuilder::build() in init_reader()
- fill_all_columns() returns true (fills all columns from ES response)

### Release note

None

### Check List (For Author)

- Test: BE compilation verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…patch

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Register the EsHttpReader in FileScanner::_get_next_reader()
dispatch so that FORMAT_ES_HTTP scan ranges are handled by the unified
FILE_SCAN path. Also fixes forward declarations and destructor visibility
for EsHttpReader.

Changes:
- Add FORMAT_ES_HTTP case in FileScanner::_get_next_reader() switch
- Add es_http_reader.h include to file_scanner.cpp
- Fix forward declaration tags (class vs struct) for TFileScanRangeParams
- Move EsHttpReader destructor to .cpp (unique_ptr needs complete type)

None

- Test: BE compilation verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pushdown metadata

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:
The existing ConnectorScanPlanProvider.getScanNodeProperties() returns only
a Map<String,String>, forcing connectors like ES to encode filter pushdown
metadata (not-pushed conjunct indices) as serialized strings in the properties
map. This is fragile and couples the engine to a magic key convention.

This commit introduces:
1. ScanNodePropertiesResult - a typed wrapper that carries both the properties
   map and a Set<Integer> of not-pushed conjunct indices
2. ConnectorScanPlanProvider.getScanNodePropertiesResult() - a new default
   method that wraps getScanNodeProperties() for backward compatibility

Connectors that perform fine-grained conjunct pushdown (ES query DSL) can
override getScanNodePropertiesResult() to return structured pushdown results.
The engine (PluginDrivenScanNode) will call this method instead of parsing
magic string keys.

### Release note

None

### Check List (For Author)

- Test: No need to test - SPI interface addition with default implementation
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ilter pushdown

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:
ES connector used ConnectorScanRangeType.ES_SCAN which required a separate
PluginDrivenEsScanNode and distinct Thrift path (TEsScanRange / ES_HTTP_SCAN_NODE).
This prevented unification with the JDBC/file-based scan path.

This commit changes EsScanRange and EsScanPlanProvider to route through FILE_SCAN:

1. EsScanRange: getRangeType() returns FILE_SCAN, adds getPath() ("es://index/shard"),
   getTableFormatType() ("es"), getFileFormat() ("es_http")
2. EsScanPlanProvider: getScanRangeType() returns FILE_SCAN, adds file_format_type=es_http
   to properties for PluginDrivenScanNode.getFileFormatType() mapping
3. Replaces string-serialized _not_pushed_conjunct_indices with structured
   getScanNodePropertiesResult() returning ScanNodePropertiesResult with Set<Integer>
4. Updates test to expect FILE_SCAN instead of ES_SCAN

### Release note

None

### Check List (For Author)

- Test: Regression test (unit test updated)
- Behavior changed: No (ES connector not yet wired through PluginDrivenScanNode)
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
morningman and others added 16 commits April 21, 2026 20:37
…compatible keys

### What problem does this PR solve?

Issue Number: close #xxx

Problem Summary: After changing EsScanRange.getProperties() to use
BE-compatible keys (index, type, shard_id, host_port), the PROP_*
constants and unit tests still referenced the old es.* prefixed keys,
causing EsNodeInfoAndScanRangeTest to fail.

Updated PROP_INDEX/PROP_TYPE/PROP_SHARD_ID to match new key names,
replaced PROP_HOSTS with PROP_HOST_PORT, and fixed all test assertions.

### Release note

None

### Check List (For Author)

- Test: Unit Test (EsNodeInfoAndScanRangeTest passes)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Previously EsScanRange only sent the first ES host
(host_port) to BE, losing the full list of shard replica hosts. This
prevented BE from selecting a local ES node for data locality, and
removed failover capability when a single ES node is down.

The old BE code in es_scan_operator.cpp had a get_host_and_port()
function that preferred the local host via BackendOptions::get_localhost()
from a list of candidate hosts. The new unified scan path was missing
this locality-aware selection.

### Changes

FE side:
- Added PROP_ES_HOSTS constant to EsScanRange
- getProperties() now includes "es_hosts" key with comma-separated full
  host list alongside the existing "host_port" (first host as fallback)
- Updated unit test to assert PROP_ES_HOSTS value

BE side:
- Added _select_host() method to EsHttpReader that parses the es_hosts
  comma-separated list and prefers the host matching the local backend
  via BackendOptions::get_localhost()
- Added KEY_ES_HOSTS static constant
- init_reader() now calls _select_host() instead of directly using
  host_port, then writes the selected host back to properties

### Release note

None

### Check List (For Author)

- Test: Unit Test (EsNodeInfoAndScanRangeTest) / Regression test (all 7 ES suites pass)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Two issues in PluginDrivenScanNode:

1. Double computation: pruneConjunctsFromNodeProperties() called
   getScanNodePropertiesResult() which recomputed properties from
   scratch, while getOrLoadScanNodeProperties() had its own separate
   cache via getScanNodeProperties(). Both methods built column handles
   and filters redundantly.

2. Conjunct index mismatch: buildRemainingFilter() filters out CAST
   expressions when the connector does not support CAST pushdown. The
   connector then returns not-pushed indices relative to the filtered
   list, but pruneConjunctsFromNodeProperties() used those indices
   against the original conjuncts list, which could cause wrong
   conjuncts to be pruned.

- Added getOrLoadPropertiesResult() that caches a single
  ScanNodePropertiesResult, used by both getOrLoadScanNodeProperties()
  and pruneConjunctsFromNodeProperties()
- Added filteredToOriginalIndex mapping in buildRemainingFilter() that
  tracks which original conjunct indices correspond to the filtered
  (CAST-removed) list
- pruneConjunctsFromNodeProperties() now translates connector indices
  back to original indices via the mapping, and retains CAST conjuncts
  that were never sent to the connector

None

- Test: Regression test (all 7 ES suites pass)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… member, delete ES_SCAN enum

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62602

Problem Summary: Address P2 code quality issues identified in PR review:
- P2-A: Four inline `new ObjectMapper()` calls in PluginDrivenScanNode create
  unnecessary garbage; replaced with static OBJECT_MAPPER + MAP_TYPE_REF fields
- P2-B: Unused `_es_properties` member in EsHttpReader.h wastes memory
- P2-D: `ES_SCAN` enum value in ConnectorScanRangeType is unreferenced after
  ES was unified into FILE_SCAN path; deleted to avoid confusion
- P2-E: Thrift comment for es_params listed wrong key names (`es.index` etc.);
  corrected to actual keys (`index`, `type`, `shard_id`, `host_port`, `es_hosts`)

### Release note

None

### Check List (For Author)

- Test: Regression test (all 7 ES suites pass) / FE + BE build verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ization

### What problem does this PR solve?

Problem Summary: PluginDrivenScanNode contains ~350 lines of format-specific
Thrift translation code dispatched via if-else chains. This violates OCP and
makes the class a "god class". To fix this, we add SPI interface methods that
let each connector handle its own Thrift param construction, EXPLAIN output,
and table serialization.

This commit adds 4 new default methods to the SPI interfaces:
- ConnectorScanRange.populateRangeParams() - per-range Thrift construction
- ConnectorScanPlanProvider.populateScanLevelParams() - scan-level Thrift params
- ConnectorScanPlanProvider.appendExplainInfo() - connector EXPLAIN output
- ConnectorScanPlanProvider.getSerializedTable() - serialized table for BE

All methods have default implementations preserving current behavior.
No existing logic is changed.

### Release note

None

### Check List (For Author)

- Test: No need to test - pure interface additions with default implementations
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…o, Hive connectors

### What problem does this PR solve?

Problem Summary: As part of P2-F SPI anti-bloat refactoring, each connector
needs its own populateRangeParams() override so that format-specific Thrift
construction logic lives in the connector module rather than in
PluginDrivenScanNode (god class pattern).

This commit adds:
- EsScanRange.populateRangeParams(): sets es_params on TTableFormatFileDesc
- MaxComputeScanRange.populateRangeParams(): constructs TMaxComputeFileDesc
  and mutates rangeDesc path/offset/size
- TrinoScanRange.populateRangeParams(): constructs TTrinoConnectorFileDesc
  with JSON options parsing
- HiveScanRange.populateRangeParams(): no-op for plain hive, constructs
  TTransactionalHiveDesc for transactional_hive
- EsScanPlanProvider.populateScanLevelParams(): builds es_properties map,
  deserializes docvalue_context and fields_context JSON
- EsScanPlanProvider.appendExplainInfo(): ES-specific EXPLAIN output
- Added fe-thrift (provided scope) and fe-connector-api deps to Hive,
  MaxCompute, Trino connector pom.xml files

### Release note

None

### Check List (For Author)

- Test: FE compilation verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ectors

### What problem does this PR solve?

Problem Summary: Continue P2-F SPI anti-bloat refactoring for the two most
complex connectors: Hudi (format downgrade logic) and Paimon (deletion file,
count pushdown, JNI/native branching).

This commit adds:
- HudiScanRange.populateRangeParams(): dynamic JNI→native format downgrade,
  full JNI metadata (instant_time, serde, delta_logs etc.), partition values
- PaimonScanRange.populateRangeParams(): JNI vs native branching, deletion
  file sub-struct, row count pushdown, partition values with null tracking
- PaimonScanPlanProvider.populateScanLevelParams(): predicate + options_json
- PaimonScanPlanProvider.getSerializedTable(): returns serialized_table prop
- Added fe-thrift (provided) and fe-connector-api deps to Hudi and Paimon
  connector pom.xml files

### Release note

None

### Check List (For Author)

- Test: FE compilation verified
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…connector SPI

### What problem does this PR solve?

Problem Summary: PluginDrivenScanNode contained ~370 lines of format-specific
Thrift construction code (setMaxComputeParams, setTrinoConnectorParams,
setHiveParams, setTransactionalHiveParams, setHudiParams, setPaimonParams,
setEsParams, setPaimonScanLevelParams, setEsScanLevelParams, copyIfPresent)
plus ES-specific EXPLAIN logic and Paimon-specific getSerializedTable. This
"god class" pattern violated separation of concerns and made adding new
connectors require modifying fe-core.

### Release note

None

### Check List (For Author)

- Test: FE compilation verified (sh build.sh --fe)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Problem Summary: After rebasing onto latest master, GenericReader was
refactored with Template Method pattern: get_next_block is now non-virtual
(calls _do_get_next_block), get_columns is now non-virtual (calls
_get_columns_impl), and fill_all_columns was removed. EsHttpReader still
used the old virtual method signatures causing compilation failure.

### Release note

None

### Check List (For Author)

- Test: BE compilation verified, all 7 ES regression tests pass
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…format/table/es/

### What problem does this PR solve?

Problem Summary: After unifying ES scans through FILE_SCAN_NODE, the old
ES_HTTP_SCAN_NODE scan path (EsScanner, EsScanOperatorX, pipeline
registration) became dead code. Additionally, ES-related reading code was
scattered across be/src/exec/es/ and be/src/format/table/.

Changes:
- Deleted old scan path: EsScanner, EsScanOperatorX (4 files, ~520 lines)
- Removed ES_SCAN_NODE/ES_HTTP_SCAN_NODE case from pipeline_fragment_context
- Removed EsScanLocalState template instantiations and operator declarations
- Moved all ES reading code (ESScanReader, ScrollParser, ESScrollQueryBuilder,
  EsHttpReader) from be/src/exec/es/ and be/src/format/table/ to
  be/src/format/table/es/
- Updated all include paths accordingly

### Release note

None

### Check List (For Author)

- Test: BE compilation verified, all 7 ES regression tests pass
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…DrivenScanNode

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#62602

Problem Summary: PR apache#62602 unified ES catalog scanning through FILE_SCAN path
(EsHttpReader) but lost the terminate_after limit pushdown optimization that
existed in the old EsScanOperatorX. Without this optimization, queries like
`SELECT * FROM es_table LIMIT 10` use scroll mode to fetch all data from ES
instead of a single _search request with terminate_after=10, causing
significant performance regression.

### Release note

Restore ES terminate_after limit pushdown optimization: when a LIMIT clause
is present, all predicates are pushed down to ES, and the limit fits within
one batch, Doris now sends a single ES _search request with terminate_after
instead of scrolling all results. EXPLAIN output shows "ES terminate_after: N"
when this optimization is active.

### Check List (For Author)

- Test: Manual test / Regression test needed
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ate_after)

### What problem does this PR solve?

Problem Summary: Adds regression tests to verify the ES terminate_after
limit pushdown optimization works correctly after the fix in
PluginDrivenScanNode. Tests cover both ES7 and ES8 catalogs.

### Release note

None

### Check List (For Author)

- Test: Regression test (PASSED - all cases pass against live ES7/ES8)
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Restore missing RuntimeProfile counters for EsHttpReader so ES HTTP scans expose read, materialize, batch, and row metrics.

### Release note

None

### Check List (For Author)

- Test: BE build and unit test
    - Unit Test: ./run-be-ut.sh --run --filter=EsHttpReaderTest.* -j 8
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Fix Elasticsearch connector IPv6 host parsing and formatting across FE scan planning and BE host selection so IPv6 nodes are preserved correctly.

### Release note

Fix ES connector host parsing for IPv6 addresses in FE scan planning and BE host selection.

### Check List (For Author)

- Test: BE build and FE build with targeted unit tests
    - Unit Test: ./run-be-ut.sh --run --filter=EsHttpReaderTest.* -j 8
    - Unit Test: cd fe/fe-connector/fe-connector-es && mvn -q -Dtest=EsNodeInfoAndScanRangeTest,EsScanPlanProviderTest test
    - Manual test: ./build.sh --fe
- Behavior changed: Yes (ES connector now preserves IPv6 literals and formats IPv6 host:port values with brackets.)
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Fix the ES connector explain property key mismatch so EXPLAIN output includes the Elasticsearch index name again.

### Release note

Fix EXPLAIN output for ES scans to display the index name.

### Check List (For Author)

- Test: FE build and targeted unit test
    - Unit Test: cd fe/fe-connector/fe-connector-es && mvn -q -Dtest=EsScanPlanProviderTest test
    - Manual test: ./build.sh --fe
- Behavior changed: Yes (EXPLAIN for ES scans now shows the ES index name.)
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Remove the short-lived provider-local ES metadata cache, which does not provide real reuse on the current call path and couples query-scoped field context to an index-only cache key.

### Release note

None

### Check List (For Author)

- Test: FE build and targeted unit test
    - Unit Test: cd fe/fe-connector/fe-connector-es && mvn -q -Dtest=EsScanPlanProviderTest test
    - Manual test: ./build.sh --fe
- Behavior changed: No
- Does this need documentation: No

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@morningman
Copy link
Copy Markdown
Contributor Author

run buildall

@morningman
Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.06% (1843/2361)
Line Coverage 64.78% (33006/50947)
Region Coverage 65.27% (16374/25085)
Branch Coverage 55.87% (8745/15652)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 5.16% (24/465) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 27.57% (51/185) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.31% (20360/38190)
Line Coverage 36.85% (191800/520469)
Region Coverage 33.18% (149186/449656)
Branch Coverage 34.28% (65230/190313)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 90.81% (168/185) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.56% (27521/37415)
Line Coverage 57.28% (297387/519174)
Region Coverage 54.38% (246897/453982)
Branch Coverage 56.00% (106993/191043)

@morningman morningman merged commit c7ef91f into apache:master Apr 22, 2026
29 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants