Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetScan with filter takes too much time to process #13298

Open
Lordworms opened this issue Nov 7, 2024 · 2 comments
Open

ParquetScan with filter takes too much time to process #13298

Lordworms opened this issue Nov 7, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@Lordworms
Copy link
Contributor

Describe the bug

I was doing #13054 and after rebase main I find out that the current parquet predicate pushdown may have some problem, it is 7 times slower than without a predicate
image

To Reproduce

the script to generate the parquet file

import pandas as pd
import numpy as np

part = pd.DataFrame({
    'p_partkey': np.random.randint(10, 21, size=1000),
    'p_brand': np.random.choice(['Brand#1', 'Brand#2', 'Brand#3'], 1000),
    'p_container': np.random.choice(['SM BOX', 'LG BOX', 'MED BOX'], 1000)
})

lineitem = pd.DataFrame({
    'l_partkey': np.random.randint(1, 1000001, size=100000000),
    'l_quantity': np.random.uniform(1, 50, size=100000000),
    'l_extendedprice': np.random.uniform(100, 10000, size=100000000)
})

part.to_parquet('/Users/yxiang1/Personal/code/datafusion/.vscode/part.parquet', index=False)
lineitem.to_parquet('/Users/yxiang1/Personal/code/datafusion/.vscode/lineitem.parquet', index=False)

The results

> CREATE EXTERNAL TABLE lineitem (
    l_partkey BIGINT,
    l_quantity DOUBLE,
    l_extendedprice DOUBLE
)
STORED AS PARQUET
LOCATION '/Users/yxiang1/Personal/code/datafusion/.vscode/lineitem.parquet';
0 row(s) fetched. 
Elapsed 0.016 seconds.

> set datafusion.execution.target_partitions = 2;
0 row(s) fetched. 
Elapsed 0.002 seconds.

> select count(*) from linitem where l_partkey >= 10 and l_partkey <= 20;
Error during planning: table 'datafusion.public.linitem' not found
> select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20;
+----------+
| count(*) |
+----------+
| 1093     |
+----------+
1 row(s) fetched. 
Elapsed 7.037 seconds.

> select count(*) from lineitem;
+-----------+
| count(*)  |
+-----------+
| 100000000 |
+-----------+
1 row(s) fetched. 
Elapsed 1.022 seconds.

> set datafusion.execution.parquet.pushdown_filters = true;
0 row(s) fetched. 
Elapsed 0.002 seconds.

> select count(*) from lineitem;
+-----------+
| count(*)  |
+-----------+
| 100000000 |
+-----------+
1 row(s) fetched. 
Elapsed 1.029 seconds.

> select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20;
+----------+
| count(*) |
+----------+
| 1093     |
+----------+
1 row(s) fetched. 
Elapsed 6.748 seconds.

Expected behavior

No response

Additional context

No response

@Dandandan
Copy link
Contributor

Dandandan commented Nov 7, 2024

It's a known issue that filter pushdown often slows down parquet scan at the moment, see for example:

apache/arrow-rs#5523
#3463

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants