Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick some pr to 21 #43010 #43030 #43785 #44779 #44786 #44857 #45129

Merged
merged 7 commits into from
Dec 9, 2024

Conversation

seawinde
Copy link
Contributor

@seawinde seawinde commented Dec 6, 2024

What problem does this PR solve?

cherry pick

55fde45
#43010

6f87e35
#43030

9daa3b7
#43785

5ce7604
#44779

d350e3c
#44786

8ea48af
#44857

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

seawinde and others added 6 commits December 6, 2024 20:08
…using sync mv (apache#43010)

Root Cause Analysis:
Currently, the statistics reported by BE (Backend) nodes have higher
priority than those from ANALYZE statements. During the first INSERT
INTO operation, the system waits for row count reports from all tablets
before updating the table statistics.
Subsequent INSERT INTO operations cannot obtain the status of all
tablets, so the system continues to use the statistical information from
the first INSERT INTO operation. This leads to a lower estimated cost
for the original table's query plan, resulting in the selection of the
original table's query plan instead of the materialized view.

Conclusion:
The test case should be modified to include a larger dataset in the
first INSERT INTO operation, which will increase the likelihood of
utilizing the materialized view. This is because the cost estimation
will better reflect the actual data distribution and size, leading to
more accurate plan selection.
…LESAMPLE or tablet and so on (apache#43030)

Related PR: apache#28064

Materialized view is as following:

        CREATE MATERIALIZED VIEW mv1
        BUILD IMMEDIATE REFRESH AUTO ON MANUAL
        DISTRIBUTED BY RANDOM BUCKETS 2
        PROPERTIES ('replication_num' = '1')
        AS
       select * from orders

If run query as following, should rewrite fail by materialized view
above to make sure data correctness

select * from orders TABLET(110);
select * from orders index query_index_test;
select * from orders TABLESAMPLE(20 percent);
select * from orders_partition PARTITION (day_2);

At before, this would rewrite by materialized view succesfully and the
result data is wrong, This pr fix this.
…make sure rewrite result stable (apache#43785)

The result of successful rewriting by the cbo optimizer depends on the
statistics.
The priority of the optimizer consumption statistics in descending order
is
1. the injected statistics
2. the statistics reported by be
3. and the statistics analyzed actively.

When the pipeline runs the case, the statistics reported by be may not
be timely. Therefore, the outcome that leads to the cbo optimizer's
successful selection of overwrites is not very certain, so the
statistics are currently injected manually in the test cases
… condition has alias (apache#44779)

Related PR: apache#27922

Problem Summary:
query and mv def are as following,` partsupp.public_col as public_col `
is alias, this would cause rewritting fail by materialized view with
msg, the graph logic between query and view is different.

      select
      o_custkey,
      o_orderdate,
      o_shippriority,
      o_comment,
      o_orderkey,
      orders.public_col as col1,
      l_orderkey,
      l_partkey,
      l_suppkey,
      lineitem.public_col as col2,
      ps_partkey,
      ps_suppkey,
      partsupp.public_col as col3,
      partsupp.public_col * 2 as col4,
      o_orderkey + l_orderkey + ps_partkey * 2,
      sum(
        o_orderkey + l_orderkey + ps_partkey * 2
      ),
      count() as count_all
    from
      (
        select
          o_custkey,
          o_orderdate,
          o_shippriority,
          o_comment,
          o_orderkey,
          orders.public_col as public_col
        from
          orders
      ) orders
      left join (
        select
          l_orderkey,
          l_partkey,
          l_suppkey,
          lineitem.public_col as public_col
        from
          lineitem
        where
          lineitem.public_col is null
          or lineitem.public_col <> 1
      ) lineitem on l_orderkey = o_orderkey
      inner join (
        select
          ps_partkey,
          ps_suppkey,
          partsupp.public_col as public_col
        from
          partsupp
      ) partsupp on ps_partkey = o_orderkey
    where
      lineitem.public_col is null
      or lineitem.public_col <> 1
      and o_orderkey = 2
    group by
      1,
      2,
      3,
      4,
      5,
      6,
      7,
      8,
      9,
      10,
      11,
      12,
      13,
      14;

Fix rewrite fail by materialized view when filter or join condition has
alias
…e when collect table of mtmv (apache#44786)

Optimize plan generate when create mtmv and use mtmv cache when collect
table of mtmv
1. Reuse plans when creating materialized views to minimize plan
generation overhead.
2. During recursive base table resolution for MTMVs, prioritize MTMV
cache lookup. Fall back to real-time generation only when cache miss
occurs.
…te (apache#44857)

### What problem does this PR solve?

Related PR: apache#33988

Problem Summary:

if mv def contain cte and is partition mv, such as following:

    CREATE MATERIALIZED VIEW mv_name
        BUILD IMMEDIATE REFRESH COMPLETE ON MANUAL
        PARTITION BY (l_shipdate)
        DISTRIBUTED BY RANDOM BUCKETS 2
        PROPERTIES ('replication_num' = '1') 
        AS
    WITH scan_data_cte as (
        select t1.l_shipdate, t1.L_LINENUMBER, orders.O_CUSTKEY, l_suppkey
        from (select * from lineitem where L_LINENUMBER > 1) t1
        left join orders on t1.L_ORDERKEY = orders.O_ORDERKEY
    )
    SELECT *  FROM scan_data_cte; 

if run refresh cmd as following, this would fail, throw exception `no
partition for this tuple`, this pr fix this

refresh materialized view mv_name partitions(p_20231210_20231211);

### Release note

Fix refresh materialized view fail when mv def contains cte
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@seawinde
Copy link
Contributor Author

seawinde commented Dec 6, 2024

run buildall

@seawinde
Copy link
Contributor Author

seawinde commented Dec 6, 2024

run buildall

@yiguolei yiguolei merged commit 1662e47 into apache:branch-2.1 Dec 9, 2024
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants