Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The validator should require files to more strictly adhere to CSV #1924

Open
dancory-urbanfootprint opened this issue Nov 22, 2024 · 2 comments
Labels
bug Something isn't working (crash, a rule has a problem) status: Needs triage Applied to all new issues

Comments

@dancory-urbanfootprint
Copy link

Describe the bug

The validator will accept files that contain unescaped quotes in string values rather than failing them with the csv_parsing_failed error.

The univocity parser by default will read unescaped quotes as though the entire value is not escaped, looking for the next delimiter. Many libraries do not allow this.

To make the univocity parser stricter about this, use the UnescapedQuoteHandling.RAISE_ERROR setting.

Steps/Code to Reproduce

Validate any of the files attached in files used.

Expected Results

The reports should contain a csv_parsing_failed for the stops.txt file (and probably others)

Actual Results

The reports do not show a csv_parsing_failed error

Screenshots

No response

Files used

Here are some existing feeds with unquoted quotes.
mdb-2000-202411140002.zip
mdb-1271-202406071530.zip
mdb-1185-202406071652.zip
mdb-902-202402080014.zip

Validator version

6.0

Operating system

Windows 11

Java version

17.0.7

Additional notes

No response

@dancory-urbanfootprint dancory-urbanfootprint added bug Something isn't working (crash, a rule has a problem) status: Needs triage Applied to all new issues labels Nov 22, 2024
Copy link

welcome bot commented Nov 22, 2024

Thanks for opening your first issue in this project! If you haven't already, you can join our slack and join the #gtfs-validators channel to meet our awesome community. Come say hi 👋!

Welcome to the community and thank you for your engagement in open source! 🎉

@github-project-automation github-project-automation bot moved this to Requires investigation in Bug triage Nov 22, 2024
@dancory-urbanfootprint
Copy link
Author

The GTFS Schedule specification says:
Field values that contain quotation marks or commas must be enclosed within quotation marks. In addition, each quotation mark in the field value must be preceded with a quotation mark. This is consistent with the manner in which Microsoft Excel outputs comma-delimited (CSV) files. For more information on the CSV file format, see http://tools.ietf.org/html/rfc4180. The following example demonstrates how a field value would appear in a comma-delimited file:
Original field value: Contains "quotes", commas and text
Field value in CSV file: "Contains ""quotes"", commas and text"

These files should fail according to that statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working (crash, a rule has a problem) status: Needs triage Applied to all new issues
Projects
Status: Requires investigation
Development

No branches or pull requests

1 participant