Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLSX files are not detected correctly #87

Closed
jeverling opened this issue Jul 18, 2022 · 8 comments
Closed

XLSX files are not detected correctly #87

jeverling opened this issue Jul 18, 2022 · 8 comments

Comments

@jeverling
Copy link

Hi all, XLSX files are detected as Zip files when ckanext-s3filestore is active. The reason is, that the extension only looks at the first 512 bytes of the uploaded file: https://github.com/qld-gov-au/ckanext-s3filestore/blob/cf0c5bd/ckanext/s3filestore/uploader.py#L545

The part that differentiates a XLSX from a regular Zip comes later (when, depends on the file as well):

In [1]: import magic
In [2]: mime = magic.Magic(mime=True)
In [3]: f = open("empty.xlsx", "rb")
In [4]: mime.from_buffer(f.read(512))
Out[4]: 'application/zip'

In [5]: f.seek(0)
Out[5]: 0

In [6]: mime.from_buffer(f.read(1000))
Out[6]: 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

Since the amount of bytes you have to read before python-magic determines it's a XLSX changes with XLSX size/complexity, you probably have to pass the whole file to be sure it works reliably.

CKAN by default tries to look at the file extension to determine the mimetype (config ckan.mimetype_guess = file_ext), or reads the whole file (ckan.mimetype_guess = file_contents): https://github.com/ckan/ckan/blob/0a596b8/ckan/lib/uploader.py#L274

What do you think is the best way to fix this? To behave the same way as CKAN does, or stick to magic.from_buffer but read the whole file? Performance wise that probably won't do any harm. Or use magic.from_file/from_descriptor?

@ThrawnCA
Copy link
Contributor

Ah, thanks for identifying this. We've had a few similar reports, mostly involving DOCX files. Do you find that re-uploading the same file resolves it?

Perhaps we could use magic.from_file first, since it's fast and should work best for recognised types, then type sniffing as a backup if the filename isn't recognised.

@jeverling
Copy link
Author

Hm no, re-uploading doesn't seem to change anything.

Perhaps we could use magic.from_file first, since it's fast and should work best for recognised types, then type sniffing as a backup if the filename isn't recognised.

Sounds good to me!

@ThrawnCA
Copy link
Contributor

I think that re-uploading works for us because we're also using https://github.com/qld-gov-au/ckanext-resource-type-validation/

This fix will likely be prioritised, as it's affecting our clients.

@ThrawnCA
Copy link
Contributor

@jeverling
Copy link
Author

Thanks a lot! I can confirm it works! In case this is useful for others: the XLS icon was still not displayed for me, and format was still set to application/zip, even though the mimetype was set correctly now.
Turns out the Docker image didn't have mime-types configured, so while python-magic recognized the mimetype correctly now, the code that sets the format (validators.py#L799) couldn't detect the mimetype. Installing the mime-support package from the debian repositories in the image solved that issue for me.

@jeverling
Copy link
Author

BTW, another issue for us is that XLSX files aren't pushed to the CKAN DataStore (anymore). This seems to be due to xlrd removing support for XLSX files: ckan/datapusher#232
xlrd is used by messytables, which is also not maintained anymore. We switched to https://github.com/ckan/ckanext-xloader from datapusher, but it has the same problem. There is a successor to messytables called frictionless, probably the best way would be to switch to that.

Are you using datapusher or xloader to submit tabular data to the DataStore?

@ThrawnCA
Copy link
Contributor

We use XLoader. I see you've raised this there, at ckan/ckanext-xloader#161

As far as I can see, it's not related to ckanext-s3filestore? But if you want to coordinate with us about it, we do have an XLoader fork, https://github.com/qld-gov-au/ckanext-xloader

@jeverling
Copy link
Author

It's not related to ckanext-s3filestore at all. Thanks, I will check out your fork of XLoader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants