Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make blockSize configurable for Symlink Tables Code Path #23608

Open
agrawalreetika opened this issue Sep 9, 2024 · 0 comments · May be fixed by #23635
Open

Make blockSize configurable for Symlink Tables Code Path #23608

agrawalreetika opened this issue Sep 9, 2024 · 0 comments · May be fixed by #23635
Assignees

Comments

@agrawalreetika
Copy link
Member

Make blockSize configurable for Symlink Tables Code Path

Expected Behavior or Use Case

Split generation for the Symlink table is handled via the Hadoop library here

Currently, s3 default block size is 32MB here which is not configurable.
And for the hdfs file system default block size is 128MB as mentioned here

As in FileInputFormat split size is calculated based on these computeSplitSize(goalSize, minSize, blockSize)

So it would be better to make this splitSize configurable even from Presto.

Setting org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to Presto property getMaxSplitSize(session).toBytes() in Symlink table configuration block

            Configuration configuration = targetFilesystem.getConf();
            configuration.set(SPLIT_MINSIZE, Long.toString(getMaxSplitSize(session).toBytes()));

Presto Component, Service, or Connector

presto-hive

Possible Implementation

Setting org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to Presto property getMaxSplitSize(session).toBytes() in Symlink table configuration block

            Configuration configuration = targetFilesystem.getConf();
            configuration.set(SPLIT_MINSIZE, Long.toString(getMaxSplitSize(session).toBytes()));

Example Screenshots (if appropriate):

Adding sample results with tpc-ds query with sf1k data on s3. For s3 default block size is 32MB here

Here Base run is for s3 default block size is 32MB.
Target run is for s3 default block size is 256MB.

image

Context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant