-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENG-1809] Improve .csv dialect sniffing #362
base: develop
Are you sure you want to change the base?
[ENG-1809] Improve .csv dialect sniffing #362
Conversation
3131b7e
to
641367a
Compare
1de56f0
to
60521e9
Compare
CSV sniffer used to sniff a fixed size of 2048 Bytes of any given file. However, this led to wrong guessed delimiter if the first row is larger than 2048 Bytes. The issue is fixed and accuracy is also improved by allowing the sniffer to adaptively sniff the 1st full row as long as it is within the max render file size.
* Updated tabular rendered tests * Wrote tests for new utility methods added to `stdlib_tools`
3ffbffa
to
9e8b9fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commits have been further cleaned up, expanded unit tests and PR ready to review again. 🎆
# Prepare the first row for sniffing | ||
data = fp.read(INIT_SNIFF_SIZE) | ||
data = _trim_or_append_data(fp, data, INIT_SNIFF_SIZE, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial solution (see below) didn't work 100% since for some file, sniffing the full file ended up in wrong delimiter while sniffing only the first row worked as expected. This is why even when the sniffer can sniff the full file, MFR only provides it with the first row.
...
data = fp.read(TABULAR_INIT_SNIFF_SIZE)
if len(data) == TABULAR_INIT_SNIFF_SIZE:
data = _trim_or_append_data(fp, data, TABULAR_INIT_SNIFF_SIZE, 0)
...
:param text: the current text chunk to check the new line character | ||
:param read_size: the last read size when `fp.read()` is called | ||
:param size_to_sniff: the accumulated size fo the text to sniff | ||
:param max_render_size: the max file size for render |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using MAX_FILE_SIZE
directly, the max_render_size=
provides an option for unit tests to use a much smaller size.
index = text.find('\r\n') | ||
if index == -1: | ||
index = text.find('\n') | ||
if index == -1: | ||
index = text.find('\r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why .rfind()
was used in the initial solution. One guess is that it only reads a fixed size for sniff so it wants to include as many rows as possible. However, this is no longer the case in my implementation where only the first row is sniffed.
index = text.rfind('\r\n')
if index == -1:
index = text.rfind('\n')
if index == -1:
index = text.rfind('\r')
Ticket
https://openscience.atlassian.net/browse/ENG-1809
Purpose
Improve .csv dialect sniffing.
Changes
Provide a full row of data to
csv.Sniffer().sniff()
so it can effectively detect the correct dialect. MFR reads a small amount of data from the file and recursively read more until either the following happens.TabularRenderError
. However, this is more like a failsafe since oversized file should never reach the sniffer in the first place.Updated existing and added new unit tests
Side effects
Memory usage for sniffing depends on the size of the first row of the file. In the worst case, it could sniff at most 10MB. Please note that we don't have enough statistical data on how many bytes the first row of a
.csv
file takes on average. However, files larger than 10MB have already failed the size check by the renderer before the sniffing starts.Given that for CSV, we needs to read the full file into memory for rendering anyway and the partial sniff data is always deleted after use, I don't think it will be a problem.
QA Notes
TBD
Deployment Notes
N / A