Databricks profiling report while using ydata-profiling #1605

Fgoudarzi · 2024-06-13T18:43:54Z

Current Behaviour

I'm making a very simple Spark dataframe with only one column. Apparently, ProfileReport does not generate the report when I am using Databricks notebook.:
Below is the code that I'm using:

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(name='Ali'),
    Row(name='John'),
    Row(name='Sara'),
    Row(name='John')
])
p2 = ProfileReport(df1)
p2

But if I convert the dataframe to panda, then it will generate the report:

Expected Behaviour

Generate the report as it does when I convert the Spark dataframe to Panda.

Data Description

Generated in the code.

Code that reproduces the bug

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(name='Ali'),
    Row(name='John'),
    Row(name='Sara'),
    Row(name='John')
])
p2 = ProfileReport(df1)
p2

pandas-profiling version

ydata_profiling = 4.8.3

Dependencies

ydata_profiling = 4.8.3
numpy = 1.24.4

OS

Windows 11 Enterprise

Checklist

There is not yet another bug report for this issue in the issue tracker
The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
The issue has not been resolved by the entries listed under Common Issues.

The text was updated successfully, but these errors were encountered:

shawn-eary · 2024-06-14T16:33:04Z

I'm also getting the behavior described above in Databricks using 1.23.5 of numpy and 4.5.1 of ydata_profiling.

I'm using a Personal Compute cluster with 15.2 ML Runtime, 28 GB Memory and 8 Active Cores at 1.5 DBU / h.

shawn-eary · 2024-06-14T20:39:24Z

For thoroughness. I also did a few tests on Azure Synapse Analytics (ASA) [without Databricks].

If I run this code in ASA:

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(c1='Ali',c2='Brown'),
    Row(c1='John',c2='Brown'),
    Row(c1='Sara',c2='Brown')
])
p2 = ProfileReport(df1)
p2

I get the error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.ml.stat.Correlation.corr.
: java.lang.RuntimeException: Cannot determine the number of cols because it is not specified in the constructor and the rows RDD is empty.

But if I simply add a numeric column at the end (Per Suggestion from Anomaly Author)

from ydata_profiling import ProfileReport
from pyspark.sql import Row
df1 = spark.createDataFrame([
    Row(c1='Ali',c2='Brown',c3=1),
    Row(c1='John',c2='Brown',c3=2),
    Row(c1='Sara',c2='Brown',c3=3)
])
p2 = ProfileReport(df1)
p2

It runs fine...

I talked to the author of this anomaly report and understood her to say that ProfileReport will probably fail when all of the spark.createDataFrame columns are strings.

This behavior seems to be happening in both Azure Databricks and ASA Spark.

Spark Dependencies
ydata_profiling 4.8.3
numpy 1.23.5

Spark Pool Settings:

fabclmnt · 2024-07-09T20:52:43Z

Hi @Fgoudarzi ,

thank you for your request. Have you tried to generate the report while following this tutorial? https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html

azory-ydata added the needs-triage label Jun 13, 2024

fabclmnt changed the title ~~Bug Report~~ Databricks profiling report while using ydata-profiling Jul 9, 2024

fabclmnt added information requested ❔ Cannot reproduce, waiting for minimum reproduction details. and removed needs-triage labels Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks profiling report while using ydata-profiling #1605

Databricks profiling report while using ydata-profiling #1605

Fgoudarzi commented Jun 13, 2024

shawn-eary commented Jun 14, 2024 •

edited

Loading

shawn-eary commented Jun 14, 2024 •

edited

Loading

fabclmnt commented Jul 9, 2024

Databricks profiling report while using ydata-profiling #1605

Databricks profiling report while using ydata-profiling #1605

Comments

Fgoudarzi commented Jun 13, 2024

Current Behaviour

Expected Behaviour

Data Description

Code that reproduces the bug

pandas-profiling version

Dependencies

OS

Checklist

shawn-eary commented Jun 14, 2024 • edited Loading

shawn-eary commented Jun 14, 2024 • edited Loading

fabclmnt commented Jul 9, 2024

shawn-eary commented Jun 14, 2024 •

edited

Loading

shawn-eary commented Jun 14, 2024 •

edited

Loading