Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Presto Data into Spark Dataframe #23673

Open
thirumalairajr opened this issue Sep 18, 2024 · 2 comments
Open

Read Presto Data into Spark Dataframe #23673

thirumalairajr opened this issue Sep 18, 2024 · 2 comments
Labels

Comments

@thirumalairajr
Copy link

Hi Team,

I have requirement for using reading data from presto query and load it into Spark Dataframe and do further processing using it in Spark.

  • Presto JDBC driver might not be useful for me because the amount of data read might be sometimes large.
  • Thinking of using Presto on Spark APIs to read the data into dataframe rather that just returning the results. Is this already supported ?
@singcha
Copy link
Contributor

singcha commented Sep 18, 2024

Can you share more on this? The part I am getting confused on why presto is needed to read the data, rather than just using spark df or spark-sql ?

This is not supported currently.

Depending on how large data is and where driver is running, you can go with different approaches :
If you can fit the data on driver node, you can call into IPrestoSparkQueryExecution.execute() method which returns the result of presto query and then load it in df for further processing.

If data can not fit on driver :

  1. You can use insert queries to write to intermediate location and load it back in an RDD, or
  2. You will need to modify implementations of IPrestoSparkQueryExecution.execute() method to not fetch result of query back on driver. Instead, load the RDD in dataframe using SparkSession.createDataFrame() when executing finalFragment. See code in PrestoSparkAdaptiveQueryExecution.executeFinalFragment() which will need to be modified

@thirumalairajr
Copy link
Author

@singcha Thanks for your quick response.

Can you share more on this? The part I am getting confused on why presto is needed to read the data, rather than just using spark df or spark-sql ?

We have many Presto views stored in the Hive Metastore, and there are requirements to build Spark pipelines that read data from these Presto views.

Some of the Presto views are large and might not fit in the driver memory.

If data can not fit on driver :

  1. You can use insert queries to write to intermediate location and load it back in an RDD, or
  2. You will need to modify implementations of IPrestoSparkQueryExecution.execute() method to not fetch result of query back on driver. Instead, load the RDD in dataframe using SparkSession.createDataFrame() when executing finalFragment. See code in PrestoSparkAdaptiveQueryExecution.executeFinalFragment() which will need to be modified

I think option 2 makes more sense because it avoids the overhead of using an intermediate storage location, and the reads can be done within the same Spark session.

We could actually write a new method, IPrestoSparkQueryExecution.read() , which would return a Spark DataFrame by following the approach you suggested.

I can contribute to this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 Unprioritized
Development

No branches or pull requests

2 participants