fix: check overflow numbers while inferring type for csv files #6481
+6
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Related to #2580. Also related to apache/datafusion#3174
Rationale for this change
Currently we use regex to infer types in .csv files. The regex for
Int64
is(^-?(\d+)$)
which accepts all numbers even overflow (this caused apache/datafusion#3174). Initially I think we can use a regex that match the numbers in range, but the regex will be too long (more than 300 chars as I tried.We can turn to a function trying to parse the string to
i64
, which is simple and flexible. The original regex could be kept or changed to more effective funtions if needed.What changes are included in this PR?
Change the regex mentioned above to funtions.
I only changed the boolean and i64 to functions since it's obvious. The regex of decimal is extended to accept overflowing numbers. Other regex is kept. I also add a TODO for further improvements. (I'd like to try to change it later if following questions are addressed)
Some questions here:
^\d{4}-\d\d-\d\d[T ]\d\d:\d\d:\d\d(?:[^\d\.].*)?$
which accept some illegal timestamps like1000-00-00T11:11:11(adewoifas)
. I wonder if it's alright to usechrono::NaiveDateTime::parse_from_str(s, "%Y-%m-%d %H:%M:%S").is_ok()
to replace it.uk_cities.csv
which has meaningful content. I wonder if it's alright to add some meaningless strings to it for testing.Are there any user-facing changes?
The numbers that overflow now will be inferred as decimal type instead of int64.