You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the field contain integer and float, frictionless describe this filed as a number type. It works well when the data file is small. But we have some issue with it when our data file is big. For example we have a big data file that is about 2GB, one of the field can have 0 or a float number. For this field, most rows have a value of 0, only a few have a float value. And when frictionless describe the table, it describe this filed as a integer type instead of a number type. It fails to see those float values in this row. Can this bug be fixed? Thanks!
This is the output to describe a small data file (test.tsv) and a big data file (TSTFI46007602.tsv) with the same fields. You can see ref_score identified as a number type in the small size file but a integer type in the big size file:
Diving into the code it looks like the sample that is analysed to "guess" the type of the column is hardcoded to 100 rows here.
I can reproduce with a csv file with 1 column, 100 rows of zeros followed by a decimal value.
Can you confirm that your data starts with at least 100 lines of zeros ?
Unfortunately I can't think of a workaround right now... Can I ask you what your use case is ? Is it for validation ?
Yes the first several hundred rows are 0s. We use frictionless to validate big tsv files. Right now what I do is to skip the type error if no schema is provided. Let me know if there is a better way. Thank you!
Overview
When the field contain integer and float, frictionless describe this filed as a number type. It works well when the data file is small. But we have some issue with it when our data file is big. For example we have a big data file that is about 2GB, one of the field can have 0 or a float number. For this field, most rows have a value of 0, only a few have a float value. And when frictionless describe the table, it describe this filed as a integer type instead of a number type. It fails to see those float values in this row. Can this bug be fixed? Thanks!
This is the output to describe a small data file (test.tsv) and a big data file (TSTFI46007602.tsv) with the same fields. You can see
ref_score
identified as a number type in the small size file but a integer type in the big size file:The text was updated successfully, but these errors were encountered: