feat: estimate cardinality for semi and anti-joins using distinct counts#20904
feat: estimate cardinality for semi and anti-joins using distinct counts#20904xudong963 merged 7 commits intoapache:mainfrom
Conversation
asolimando
left a comment
There was a problem hiding this comment.
LGTM, a couple of minor points and a few tests to be added. The only change I'd like to see is bailing out when either side has no stats for a column pair.
a82d83e to
79dcc2b
Compare
79dcc2b to
ee530c3
Compare
Co-authored-by: Alessandro Solimando <alessandro.solimando@gmail.com>
asolimando
left a comment
There was a problem hiding this comment.
LGTM, thanks for addressing all my comments fully!
|
Can we run some benchmarks (e.g. tpch, tpcds, ...) for PRs like these in the future? Would be good to see impact of individual contributions and possibly catch regressions / issues. |
|
run benchmarks |
Sure, running |
|
run benchmarks tpcds tpch |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing use-ndv-for-semi-and-anti-join (5978b00) to 7acbe03 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing use-ndv-for-semi-and-anti-join (5978b00) to 7acbe03 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing use-ndv-for-semi-and-anti-join (5978b00) to 7acbe03 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing use-ndv-for-semi-and-anti-join (5978b00) to 7acbe03 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing use-ndv-for-semi-and-anti-join (5978b00) to 7acbe03 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
run benchmarks tpcds tpch |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing use-ndv-for-semi-and-anti-join (5978b00) to 7acbe03 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing use-ndv-for-semi-and-anti-join (5978b00) to 7acbe03 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
Interesting! |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
The magic of stats! |
I didn't find semi-anti joins for the query. I guess it might be related to the count(*) gets a stat directly |
|
Benchmark for this request hit the 7200s job deadline before finishing. Benchmarks requested: Kubernetes messageFile an issue against this benchmark runner |
|
Benchmark for this request hit the 7200s job deadline before finishing. Benchmarks requested: Kubernetes messageFile an issue against this benchmark runner |
|
Benchmark for this request hit the 7200s job deadline before finishing. Benchmarks requested: Kubernetes messageFile an issue against this benchmark runner |
Which issue does this PR close?
Does not close but part of #20766
Rationale for this change
Details are in #20766. But main idea is to use existing distinct count information to optimize joins similar to how Spark/Trino does
What changes are included in this PR?
This PR extends cardinality estimation for semi/anti joins using distinct counts
Are these changes tested?
I've added cases but not sure if I should've added benchmarks on this.
Are there any user-facing changes?
No