[ENG-1809] Improve .csv dialect sniffing #362

cslzchen · 2020-05-29T02:30:55Z

Ticket

https://openscience.atlassian.net/browse/ENG-1809

Purpose

Improve .csv dialect sniffing.

Changes

Provide a full row of data to csv.Sniffer().sniff() so it can effectively detect the correct dialect. MFR reads a small amount of data from the file and recursively read more until either the following happens.
- Nothing available. Return the empty string since there is no new line characters in this file at all.
- The to-sniff size goes over max render size allowed. Return TabularRenderError. However, this is more like a failsafe since oversized file should never reach the sniffer in the first place.
- Any type of new line characters found, trim (if necessary) and send the the full first row to the sniff.
- Please see the video captures in the tickets as a quick demo.
Updated existing and added new unit tests

Side effects

Memory usage for sniffing depends on the size of the first row of the file. In the worst case, it could sniff at most 10MB. Please note that we don't have enough statistical data on how many bytes the first row of a .csv file takes on average. However, files larger than 10MB have already failed the size check by the renderer before the sniffing starts.

Given that for CSV, we needs to read the full file into memory for rendering anyway and the partial sniff data is always deleted after use, I don't think it will be a problem.

For CR: maybe add a sentry alert when the sniffing size goes over 512KB (or some other size) so we know how often this happens and how effective this algorithm actually is.

QA Notes

TBD

Deployment Notes

N / A

CSV sniffer used to sniff a fixed size of 2048 Bytes of any given file. However, this led to wrong guessed delimiter if the first row is larger than 2048 Bytes. The issue is fixed and accuracy is also improved by allowing the sniffer to adaptively sniff the 1st full row as long as it is within the max render file size.

* Updated tabular rendered tests * Wrote tests for new utility methods added to `stdlib_tools`

cslzchen

Commits have been further cleaned up, expanded unit tests and PR ready to review again. 🎆

cslzchen · 2020-06-05T19:03:52Z

mfr/extensions/tabular/libs/stdlib_tools.py

+    # Prepare the first row for sniffing
+    data = fp.read(INIT_SNIFF_SIZE)
+    data = _trim_or_append_data(fp, data, INIT_SNIFF_SIZE, 0)


The initial solution (see below) didn't work 100% since for some file, sniffing the full file ended up in wrong delimiter while sniffing only the first row worked as expected. This is why even when the sniffer can sniff the full file, MFR only provides it with the first row.

... data = fp.read(TABULAR_INIT_SNIFF_SIZE) if len(data) == TABULAR_INIT_SNIFF_SIZE: data = _trim_or_append_data(fp, data, TABULAR_INIT_SNIFF_SIZE, 0) ...

cslzchen · 2020-06-05T19:04:50Z

mfr/extensions/tabular/libs/stdlib_tools.py

+    :param text: the current text chunk to check the new line character
+    :param read_size: the last read size when `fp.read()` is called
+    :param size_to_sniff: the accumulated size fo the text to sniff
+    :param max_render_size: the max file size for render


Instead of using MAX_FILE_SIZE directly, the max_render_size= provides an option for unit tests to use a much smaller size.

cslzchen · 2020-06-05T19:09:01Z

mfr/extensions/tabular/libs/stdlib_tools.py

+    index = text.find('\r\n')
+    if index == -1:
+        index = text.find('\n')
+        if index == -1:
+            index = text.find('\r')


I am not sure why .rfind() was used in the initial solution. One guess is that it only reads a fixed size for sniff so it wants to include as many rows as possible. However, this is no longer the case in my implementation where only the first row is sniffed.

index = text.rfind('\r\n') if index == -1: index = text.rfind('\n') if index == -1: index = text.rfind('\r')

cslzchen force-pushed the feature/improve-csv-sniffing branch from 3131b7e to 641367a Compare May 29, 2020 03:10

cslzchen changed the title ~~[WIP] [ForestFire] Improve .csv dialect sniffing~~ [WIP] [ForestFire] [ENG-1809] Improve .csv dialect sniffing May 29, 2020

cslzchen force-pushed the feature/improve-csv-sniffing branch 3 times, most recently from 1de56f0 to 60521e9 Compare June 2, 2020 06:30

cslzchen changed the title ~~[WIP] [ForestFire] [ENG-1809] Improve .csv dialect sniffing~~ [ENG-1809] Improve .csv dialect sniffing Jun 2, 2020

cslzchen added 2 commits June 5, 2020 14:50

Update unit tests

9e8b9fc

* Updated tabular rendered tests * Wrote tests for new utility methods added to `stdlib_tools`

cslzchen force-pushed the feature/improve-csv-sniffing branch from 3ffbffa to 9e8b9fc Compare June 5, 2020 19:01

cslzchen commented Jun 5, 2020

View reviewed changes

cslzchen mentioned this pull request Aug 2, 2022

CSV file sometimes displaying incorrectly CenterForOpenScience/osf.io#9970

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENG-1809] Improve .csv dialect sniffing #362

[ENG-1809] Improve .csv dialect sniffing #362

cslzchen commented May 29, 2020 •

edited

Loading

cslzchen left a comment

cslzchen Jun 5, 2020

cslzchen Jun 5, 2020

cslzchen Jun 5, 2020 •

edited

Loading

[ENG-1809] Improve .csv dialect sniffing #362

Are you sure you want to change the base?

[ENG-1809] Improve .csv dialect sniffing #362

Conversation

cslzchen commented May 29, 2020 • edited Loading

Ticket

Purpose

Changes

Side effects

QA Notes

Deployment Notes

cslzchen left a comment

Choose a reason for hiding this comment

cslzchen Jun 5, 2020

Choose a reason for hiding this comment

cslzchen Jun 5, 2020

Choose a reason for hiding this comment

cslzchen Jun 5, 2020 • edited Loading

Choose a reason for hiding this comment

cslzchen commented May 29, 2020 •

edited

Loading

cslzchen Jun 5, 2020 •

edited

Loading