Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks profiling report while using ydata-profiling #1605

Open
3 tasks done
Fgoudarzi opened this issue Jun 13, 2024 · 3 comments
Open
3 tasks done

Databricks profiling report while using ydata-profiling #1605

Fgoudarzi opened this issue Jun 13, 2024 · 3 comments
Labels
information requested ❔ Cannot reproduce, waiting for minimum reproduction details.

Comments

@Fgoudarzi
Copy link

Current Behaviour

I'm making a very simple Spark dataframe with only one column. Apparently, ProfileReport does not generate the report when I am using Databricks notebook.:
Below is the code that I'm using:

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(name='Ali'),
    Row(name='John'),
    Row(name='Sara'),
    Row(name='John')
])
p2 = ProfileReport(df1)
p2

S1

But if I convert the dataframe to panda, then it will generate the report:
S2

Expected Behaviour

Generate the report as it does when I convert the Spark dataframe to Panda.

Data Description

Generated in the code.

Code that reproduces the bug

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(name='Ali'),
    Row(name='John'),
    Row(name='Sara'),
    Row(name='John')
])
p2 = ProfileReport(df1)
p2

pandas-profiling version

ydata_profiling = 4.8.3

Dependencies

ydata_profiling = 4.8.3
numpy = 1.24.4

OS

Windows 11 Enterprise

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.
@shawn-eary
Copy link

shawn-eary commented Jun 14, 2024

I'm also getting the behavior described above in Databricks using 1.23.5 of numpy and 4.5.1 of ydata_profiling.

I'm using a Personal Compute cluster with 15.2 ML Runtime, 28 GB Memory and 8 Active Cores at 1.5 DBU / h.

@shawn-eary
Copy link

shawn-eary commented Jun 14, 2024

For thoroughness. I also did a few tests on Azure Synapse Analytics (ASA) [without Databricks].

If I run this code in ASA:

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(c1='Ali',c2='Brown'),
    Row(c1='John',c2='Brown'),
    Row(c1='Sara',c2='Brown')
])
p2 = ProfileReport(df1)
p2

I get the error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.ml.stat.Correlation.corr.
: java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.

But if I simply add a numeric column at the end (Per Suggestion from Anomaly Author)

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(c1='Ali',c2='Brown',c3=1),
    Row(c1='John',c2='Brown',c3=2),
    Row(c1='Sara',c2='Brown',c3=3)
])
p2 = ProfileReport(df1)
p2

It runs fine...
image

I talked to the author of this anomaly report and understood her to say that ProfileReport will probably fail when all of the spark.createDataFrame columns are strings.

This behavior seems to be happening in both Azure Databricks and ASA Spark.

Spark Dependencies
ydata_profiling 4.8.3
numpy 1.23.5

Spark Pool Settings:

image

@fabclmnt
Copy link
Contributor

fabclmnt commented Jul 9, 2024

Hi @Fgoudarzi ,

thank you for your request. Have you tried to generate the report while following this tutorial? https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html

@fabclmnt fabclmnt changed the title Bug Report Databricks profiling report while using ydata-profiling Jul 9, 2024
@fabclmnt fabclmnt added information requested ❔ Cannot reproduce, waiting for minimum reproduction details. and removed needs-triage labels Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
information requested ❔ Cannot reproduce, waiting for minimum reproduction details.
Projects
None yet
Development

No branches or pull requests

4 participants