Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENG-1809] Improve .csv dialect sniffing #362

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

cslzchen
Copy link
Contributor

@cslzchen cslzchen commented May 29, 2020

Ticket

https://openscience.atlassian.net/browse/ENG-1809

Purpose

Improve .csv dialect sniffing.

Changes

  • Provide a full row of data to csv.Sniffer().sniff() so it can effectively detect the correct dialect. MFR reads a small amount of data from the file and recursively read more until either the following happens.

    • Nothing available. Return the empty string since there is no new line characters in this file at all.
    • The to-sniff size goes over max render size allowed. Return TabularRenderError. However, this is more like a failsafe since oversized file should never reach the sniffer in the first place.
    • Any type of new line characters found, trim (if necessary) and send the the full first row to the sniff.
    • Please see the video captures in the tickets as a quick demo.
  • Updated existing and added new unit tests

Side effects

Memory usage for sniffing depends on the size of the first row of the file. In the worst case, it could sniff at most 10MB. Please note that we don't have enough statistical data on how many bytes the first row of a .csv file takes on average. However, files larger than 10MB have already failed the size check by the renderer before the sniffing starts.

Given that for CSV, we needs to read the full file into memory for rendering anyway and the partial sniff data is always deleted after use, I don't think it will be a problem.

  • For CR: maybe add a sentry alert when the sniffing size goes over 512KB (or some other size) so we know how often this happens and how effective this algorithm actually is.

QA Notes

TBD

Deployment Notes

N / A

@cslzchen cslzchen force-pushed the feature/improve-csv-sniffing branch from 3131b7e to 641367a Compare May 29, 2020 03:10
@cslzchen cslzchen changed the title [WIP] [ForestFire] Improve .csv dialect sniffing [WIP] [ForestFire] [ENG-1809] Improve .csv dialect sniffing May 29, 2020
@cslzchen cslzchen force-pushed the feature/improve-csv-sniffing branch 3 times, most recently from 1de56f0 to 60521e9 Compare June 2, 2020 06:30
@cslzchen cslzchen changed the title [WIP] [ForestFire] [ENG-1809] Improve .csv dialect sniffing [ENG-1809] Improve .csv dialect sniffing Jun 2, 2020
cslzchen added 2 commits June 5, 2020 14:50
CSV sniffer used to sniff a fixed size of 2048 Bytes of any given
file. However, this led to wrong guessed delimiter if the first
row is larger than 2048 Bytes. The issue is fixed and accuracy is
also improved by allowing the sniffer to adaptively sniff the 1st
full row as long as it is within the max render file size.
* Updated tabular rendered tests
* Wrote tests for new utility methods added to `stdlib_tools`
@cslzchen cslzchen force-pushed the feature/improve-csv-sniffing branch from 3ffbffa to 9e8b9fc Compare June 5, 2020 19:01
Copy link
Contributor Author

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commits have been further cleaned up, expanded unit tests and PR ready to review again. 🎆

Comment on lines +22 to +24
# Prepare the first row for sniffing
data = fp.read(INIT_SNIFF_SIZE)
data = _trim_or_append_data(fp, data, INIT_SNIFF_SIZE, 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial solution (see below) didn't work 100% since for some file, sniffing the full file ended up in wrong delimiter while sniffing only the first row worked as expected. This is why even when the sniffer can sniff the full file, MFR only provides it with the first row.

...
data = fp.read(TABULAR_INIT_SNIFF_SIZE)
if len(data) == TABULAR_INIT_SNIFF_SIZE:
    data = _trim_or_append_data(fp, data, TABULAR_INIT_SNIFF_SIZE, 0)
...

:param text: the current text chunk to check the new line character
:param read_size: the last read size when `fp.read()` is called
:param size_to_sniff: the accumulated size fo the text to sniff
:param max_render_size: the max file size for render
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using MAX_FILE_SIZE directly, the max_render_size= provides an option for unit tests to use a much smaller size.

Comment on lines +167 to +171
index = text.find('\r\n')
if index == -1:
index = text.find('\n')
if index == -1:
index = text.find('\r')
Copy link
Contributor Author

@cslzchen cslzchen Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why .rfind() was used in the initial solution. One guess is that it only reads a fixed size for sniff so it wants to include as many rows as possible. However, this is no longer the case in my implementation where only the first row is sniffed.

    index = text.rfind('\r\n')
    if index == -1:
        index = text.rfind('\n')
        if index == -1:
            index = text.rfind('\r')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant