Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError 'tinyint' during profiling on Apache Spark DataFrame #1374

Open
3 tasks done
talgatomarov opened this issue Jun 30, 2023 · 2 comments
Open
3 tasks done

KeyError 'tinyint' during profiling on Apache Spark DataFrame #1374

talgatomarov opened this issue Jun 30, 2023 · 2 comments
Labels
help wanted 🙋 Contributions are welcome! spark ⚡ PySpark features!

Comments

@talgatomarov
Copy link

Current Behaviour

I encountered an error while attempting to run profiling on an Apache Spark DataFrame. The Spark DataFrame contains data retrieved from parquet files. The specific error message I received is as follows:

Traceback (most recent call last):
  File "/tmp/profile.py", line 41, in <module>
    profile.to_html()
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
    return self.html
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 275, in html
    self._html = self._render_html()
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
    report = self.report
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 269, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 256, in description_set
    self._sample,
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/describe.py", line 73, in describe
    config, df, summarizer, typeset, pbar
  File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 93, in spark_get_series_descriptions
    executor.imap_unordered(multiprocess_1d, args)
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 88, in multiprocess_1d
    return column, describe_1d(config, df.select(column), summarizer, typeset)
  File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
    return func(*args, **kwargs)
  File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 62, in spark_describe_1d
    }[dtype]
KeyError: 'tinyint'

I believe the issue can be resolved by including data types such as "tinyint" and "smallint" in summary_spark.py.
Do you think it a right solution? If yes, I could try submitting a PR.

vtype = {
"float": "Numeric",
"int": "Numeric",
"bigint": "Numeric",
"double": "Numeric",
"string": "Categorical",
"ArrayType": "Categorical",
"boolean": "Boolean",
"date": "DateTime",
"timestamp": "DateTime",
}[dtype]

Expected Behaviour

Profiling runs

Data Description

Private dataset

Code that reproduces the bug

from ydata_profiling import ProfileReport

df = ...

profile = ProfileReport(
    df,
    title=Title',
    infer_dtypes=False,
    interactions=None,
    missing_diagrams=None,
    correlations={'auto': {'calculate': False},
                  'pearson': {'calculate': True},
                  'spearman': {'calculate': True}},
    )

pandas-profiling version

v4.3.1

Dependencies

...

OS

Spark cluster

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.
@fabclmnt fabclmnt added help wanted 🙋 Contributions are welcome! spark ⚡ PySpark features! and removed needs-triage labels Jul 6, 2023
@oguzhangur96
Copy link

@fabclmnt I have faced this bug while working on timestamp_ntz data types. If @talgatomarov is uninterested, I can attempt to resolve it for various data types.

@hb0313
Copy link
Contributor

hb0313 commented Feb 13, 2024

As I see no workaround has been mentioned, this is something that worked for me. For pyspark print the schema of the spark table, change columns that have 'short' dtype to 'int'. If you are converting pyspark dataframe to pandas print dtypes change smallint or tinyint to int.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted 🙋 Contributions are welcome! spark ⚡ PySpark features!
Projects
Status: Selected for next release
Development

No branches or pull requests

5 participants