KeyError 'tinyint' during profiling on Apache Spark DataFrame #1374

talgatomarov · 2023-06-30T21:21:04Z

Current Behaviour

I encountered an error while attempting to run profiling on an Apache Spark DataFrame. The Spark DataFrame contains data retrieved from parquet files. The specific error message I received is as follows:

Traceback (most recent call last):
  File "/tmp/profile.py", line 41, in <module>
    profile.to_html()
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
    return self.html
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 275, in html
    self._html = self._render_html()
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
    report = self.report
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 269, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 256, in description_set
    self._sample,
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/describe.py", line 73, in describe
    config, df, summarizer, typeset, pbar
  File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 93, in spark_get_series_descriptions
    executor.imap_unordered(multiprocess_1d, args)
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 88, in multiprocess_1d
    return column, describe_1d(config, df.select(column), summarizer, typeset)
  File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 62, in spark_describe_1d
    }[dtype]
KeyError: 'tinyint'

I believe the issue can be resolved by including data types such as "tinyint" and "smallint" in summary_spark.py.
Do you think it a right solution? If yes, I could try submitting a PR.

ydata-profiling/src/ydata_profiling/model/spark/summary_spark.py

Lines 52 to 62 in cfb020d

    
           vtype = { 
        
               "float": "Numeric", 
        
               "int": "Numeric", 
        
               "bigint": "Numeric", 
        
               "double": "Numeric", 
        
               "string": "Categorical", 
        
               "ArrayType": "Categorical", 
        
               "boolean": "Boolean", 
        
               "date": "DateTime", 
        
               "timestamp": "DateTime", 
        
           }[dtype]

Expected Behaviour

Profiling runs

Data Description

Private dataset

Code that reproduces the bug

from ydata_profiling import ProfileReport

df = ...

profile = ProfileReport(
    df,
    title=’Title',
    infer_dtypes=False,
    interactions=None,
    missing_diagrams=None,
    correlations={'auto': {'calculate': False},
                  'pearson': {'calculate': True},
                  'spearman': {'calculate': True}},
    )

pandas-profiling version

v4.3.1

Dependencies

...

OS

Spark cluster

Checklist

There is not yet another bug report for this issue in the issue tracker
The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
The issue has not been resolved by the entries listed under Common Issues.

The text was updated successfully, but these errors were encountered:

oguzhangur96 · 2023-11-13T12:13:31Z

@fabclmnt I have faced this bug while working on timestamp_ntz data types. If @talgatomarov is uninterested, I can attempt to resolve it for various data types.

hb0313 · 2024-02-13T16:30:44Z

As I see no workaround has been mentioned, this is something that worked for me. For pyspark print the schema of the spark table, change columns that have 'short' dtype to 'int'. If you are converting pyspark dataframe to pandas print dtypes change smallint or tinyint to int.

azory-ydata added the needs-triage label Jun 30, 2023

fabclmnt added help wanted 🙋 Contributions are welcome! spark ⚡ PySpark features! and removed needs-triage labels Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError 'tinyint' during profiling on Apache Spark DataFrame #1374

KeyError 'tinyint' during profiling on Apache Spark DataFrame #1374

talgatomarov commented Jun 30, 2023

oguzhangur96 commented Nov 13, 2023

hb0313 commented Feb 13, 2024

KeyError 'tinyint' during profiling on Apache Spark DataFrame #1374

KeyError 'tinyint' during profiling on Apache Spark DataFrame #1374

Comments

talgatomarov commented Jun 30, 2023

Current Behaviour

Expected Behaviour

Data Description

Code that reproduces the bug

pandas-profiling version

Dependencies

OS

Checklist

oguzhangur96 commented Nov 13, 2023

hb0313 commented Feb 13, 2024