You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered an error while attempting to run profiling on an Apache Spark DataFrame. The Spark DataFrame contains data retrieved from parquet files. The specific error message I received is as follows:
Traceback (most recent call last):
File "/tmp/profile.py", line 41, in <module>
profile.to_html()
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 468, in to_html
return self.html
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 275, in html
self._html = self._render_html()
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 383, in _render_html
report = self.report
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 269, in report
self._report = get_report_structure(self.config, self.description_set)
File "/home/spark/.local/lib/python3.7/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/profile_report.py", line 256, in description_set
self._sample,
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/describe.py", line 73, in describe
config, df, summarizer, typeset, pbar
File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 93, in spark_get_series_descriptions
executor.imap_unordered(multiprocess_1d, args)
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 748, in next
raise value
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 88, in multiprocess_1d
return column, describe_1d(config, df.select(column), summarizer, typeset)
File "/home/spark/.local/lib/python3.7/site-packages/multimethod/__init__.py", line 315, in __call__
return func(*args, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 62, in spark_describe_1d
}[dtype]
KeyError: 'tinyint'
I believe the issue can be resolved by including data types such as "tinyint" and "smallint" in summary_spark.py.
Do you think it a right solution? If yes, I could try submitting a PR.
@fabclmnt I have faced this bug while working on timestamp_ntz data types. If @talgatomarov is uninterested, I can attempt to resolve it for various data types.
As I see no workaround has been mentioned, this is something that worked for me. For pyspark print the schema of the spark table, change columns that have 'short' dtype to 'int'. If you are converting pyspark dataframe to pandas print dtypes change smallint or tinyint to int.
Current Behaviour
I encountered an error while attempting to run profiling on an Apache Spark DataFrame. The Spark DataFrame contains data retrieved from parquet files. The specific error message I received is as follows:
I believe the issue can be resolved by including data types such as "tinyint" and "smallint" in summary_spark.py.
Do you think it a right solution? If yes, I could try submitting a PR.
ydata-profiling/src/ydata_profiling/model/spark/summary_spark.py
Lines 52 to 62 in cfb020d
Expected Behaviour
Profiling runs
Data Description
Private dataset
Code that reproduces the bug
pandas-profiling version
v4.3.1
Dependencies
OS
Spark cluster
Checklist
The text was updated successfully, but these errors were encountered: