Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: check overflow numbers while inferring type for csv files #6481

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

CookiePieWw
Copy link

@CookiePieWw CookiePieWw commented Sep 30, 2024

Which issue does this PR close?

Related to #2580. Also related to apache/datafusion#3174

Rationale for this change

Currently we use regex to infer types in .csv files. The regex for Int64 is (^-?(\d+)$) which accepts all numbers even overflow (this caused apache/datafusion#3174). Initially I think we can use a regex that match the numbers in range, but the regex will be too long (more than 300 chars as I tried.

We can turn to a function trying to parse the string to i64, which is simple and flexible. The original regex could be kept or changed to more effective funtions if needed.

What changes are included in this PR?

Change the regex mentioned above to funtions.

I only changed the boolean and i64 to functions since it's obvious. The regex of decimal is extended to accept overflowing numbers. Other regex is kept. I also add a TODO for further improvements. (I'd like to try to change it later if following questions are addressed)

Some questions here:

  1. The regex of timestamp is ^\d{4}-\d\d-\d\d[T ]\d\d:\d\d:\d\d(?:[^\d\.].*)?$ which accept some illegal timestamps like 1000-00-00T11:11:11(adewoifas). I wonder if it's alright to use chrono::NaiveDateTime::parse_from_str(s, "%Y-%m-%d %H:%M:%S").is_ok() to replace it.
  2. The tests are performed over the file uk_cities.csv which has meaningful content. I wonder if it's alright to add some meaningless strings to it for testing.

Are there any user-facing changes?

The numbers that overflow now will be inferred as decimal type instead of int64.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 30, 2024
@tustvold
Copy link
Contributor

tustvold commented Sep 30, 2024

I worry this will significantly regress performance as well as regress functionality. The change to use NaiveDatetime will break timestamp inference for timestamps with timezones, for example. Moving away from RegexSet to serially checking Regex expressions will significantly slow down performance.

Perhaps we could take a step back and determine what it is we're trying to solve here? Inferring large integers as decimals is not really a like for like change

Edit: one less disruptive way to achieve this might be to on detecting Int64 from the RegexSet then try to parse the i64 if the string has enough characters that it could overflow. This would avoid regressing the majority of current workloads

@CookiePieWw
Copy link
Author

Edit: one less disruptive way to achieve this might be to on detecting Int64 from the RegexSet then try to parse the i64 if the string has enough characters that it could overflow. This would avoid regressing the majority of current workloads

Got you. I've changed it to a check for overflow now :)

Perhaps we could take a step back and determine what it is we're trying to solve here? Inferring large integers as decimals is not really a like for like change

We may have to find a type for large integers however, maybe utf8's better?

@CookiePieWw CookiePieWw changed the title refactor: use functions to infer types in csv fix: check overflow numbers while inferring type for csv files Sep 30, 2024
@tustvold
Copy link
Contributor

We may have to find a type for large integers however, maybe utf8's better?

Yeah I think utf8 is probably the safest option, float runs the risk of silent truncation, whereas decimal comes with its own complexities and can itself overflow.

I think all this needs now is some tests.

I don't think this is a breaking change, as the fact it inferred Int64 for values larger than fit in an Int64 could reasonably be considered a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants