-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Unable to read parquet files using fastparquet #6778
Comments
Hi @seydar, thanks for the contribution! Could you provide data to reproduce the problem? Maybe it will be possible to generate synthetic data on which the problem will be repeated? (if you can't share the dataset) |
@anmyachev What a fun bug! In trying to make my private data publicly accessible, I got to learn a little bit more about AWS and I narrowed down the bug to what the actual problem is. drumroll With
MRE: import ray
import modin.pandas as pd
ray.init("localhost:6379")
manifest_files = ['s3://ari-public-test-data/test'] # 13 KB, fails
#manifest_files = ['s3://ari-public-test-data/test.parquet'] # 13 KB, succeeds
df = pd.read_parquet(manifest_files, engine='fastparquet') Error:
|
Better MREHere's an even easier one that doesn't use S3! import modin.pandas as pd
pd.read_parquet('test.parquet', engine='fastparquet') # success
pd.read_parquet('test', engine='fastparquet') # failure You'll have to rename this data file to |
@seydar thanks for the detailed research! Looks like I found the problem place:
It turns out that for the file we at least should not use this filter. I wonder what the logic is if a folder is used like path for |
Got it: edit: whoops, had this sitting but it turned out you found it in the meantime. |
It would be great! Much appreciate your help! |
@anmyachev I'm having some trouble adding a test. Do you have any suggestions on this? Is there something I'm obviously missing? Test inserted at line 1467 # Tests issue #6778
def test_read_parquet_no_extension(self, engine, make_parquet_file):
with ensure_clean(".parquet") as unique_filename:
# Remove the .parquet extension
no_ext_fname = unique_filename[:unique_filename.index(".parquet")]
make_parquet_file(filename=no_ext_fname)
eval_io(
fn_name="read_parquet",
# read_parquet kwargs
engine=engine,
path=no_ext_fname
) Error:
|
@seydar could you try to run as the following: |
@anmyachev same error:
|
@seydar this looks very strange, the pytest can’t even start normally, it looks like there are problems with dependencies in the environment. You can open a draft pull request and see how the tests run in the CI environment. If you first want the tests to run normally locally, then let’s look at how you created the environment and what dependencies it contains. |
…ing fastparquet In supporting fastparquet, modin takes the paths provided, globs them, and filters them to only look at files with the .parq or .parquet extension. This commit adds support so that if the path supplied is explicitly a file, it will be included. Signed-off by: Ari Brown <[email protected]>
…ing fastparquet In supporting fastparquet, modin takes the paths provided, globs them, and filters them to only look at files with the .parq or .parquet extension. This commit adds support so that if the path supplied is explicitly a file, it will be included. Signed-off by: Ari Brown <[email protected]>
Submitted! #6789 Thank you very much for your help and advice on this. I'd love to dig into the pytest stuff as well, since it clearly shouldn't be like that. I built everything using pip. Not sure if there is other information I can provide for it. python: 3.10.4 |
…ing fastparquet In supporting fastparquet, modin takes the paths provided, globs them, and filters them to only look at files with the .parq or .parquet extension. This commit adds support so that if the path supplied is explicitly a file, it will be included. This commit also fixes the formatting issues for the `black` linter. Signed-off by: Ari Brown <[email protected]>
…ing fastparquet In supporting fastparquet, modin takes the paths provided, globs them, and filters them to only look at files with the .parq or .parquet extension. This commit adds support so that if the path supplied is explicitly a file, it will be included. Signed-off by: Ari Brown <[email protected]>
I can't reproduce your problem, although I haven't tried your version of python (only |
… a file Avoids the use of `os.path.isfile()` to use `self.fs.isfile()`. Signed-off-by: Ari Brown <[email protected]>
…et (#6790) * FIX-#6778: Read parquet files without file extensions using fastparquet In supporting fastparquet, modin takes the paths provided, globs them, and filters them to only look at files with the .parq or .parquet extension. This commit adds support so that if the path supplied is explicitly a file, it will be included. Signed-off by: Ari Brown <[email protected]> * FIX-#6778: Generalizes the check of whether something is a file Avoids the use of `os.path.isfile()` to use `self.fs.isfile()`. Signed-off-by: Ari Brown <[email protected]> --------- Signed-off-by: Ari Brown <[email protected]>
Modin version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
Ray + Modin are unable to read a parquet file (774 MB) using
fastparquet
. While reading the file, the submission crashes. Looking at the error, it appears that fastparquet is unable to correctly identify how many rows are in the dataframe. I believe this is happening when it's assessing the index, but I'm unable to verify that — I'm not familiar enough with the modin codebase.Expected Behavior
It should exit without error after successfully reading the file and distributing it to the cluster.
Error Logs
Installed Versions
INSTALLED VERSIONS
commit : b99cf06
python : 3.10.4.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Mar 9 20:10:19 PST 2023; root:xnu-8020.240.18.700.8~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
Modin dependencies
modin : 0.25.1
ray : 2.8.0
dask : 2023.11.0
distributed : 2023.11.0
hdk : None
pandas dependencies
pandas : 2.1.3
numpy : 1.26.0
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.3.1
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : 3.3.1
blosc : 1.11.1
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.6
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader : None
bs4 : 4.11.1
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.10.1
fsspec : 2023.6.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.8.0
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 14.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.6.0
scipy : 1.10.1
sqlalchemy : 2.0.9
tables : 3.7.0
tabulate : 0.8.9
xarray : 2022.10.0
xlrd : None
zstandard : 0.18.0
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: