Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frictionless fails to describe the table with the correct field type when the data file is big #1689

Open
mingjiecn opened this issue Sep 23, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@mingjiecn
Copy link

mingjiecn commented Sep 23, 2024

Overview

When the field contain integer and float, frictionless describe this filed as a number type. It works well when the data file is small. But we have some issue with it when our data file is big. For example we have a big data file that is about 2GB, one of the field can have 0 or a float number. For this field, most rows have a value of 0, only a few have a float value. And when frictionless describe the table, it describe this filed as a integer type instead of a number type. It fails to see those float values in this row. Can this bug be fixed? Thanks!

This is the output to describe a small data file (test.tsv) and a big data file (TSTFI46007602.tsv) with the same fields. You can see ref_score identified as a number type in the small size file but a integer type in the big size file:

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/test.tsv
# --------
# metadata: src/tests/data/test.tsv
# --------

name: test
type: table
path: src/tests/data/test.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: number
    - name: alt_score
      type: number
    - name: relative_binding_affinity
      type: number
    - name: effect_on_binding
      type: string

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/TSTFI46007602.tsv
# --------
# metadata: src/tests/data/TSTFI46007602.tsv
# --------

name: tstfi46007602
type: table
path: src/tests/data/TSTFI46007602.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: integer
    - name: alt_score
      type: integer
    - name: relative_binding_affinity
      type: integer
    - name: effect_on_binding
      type: string
@pierrecamilleri
Copy link
Collaborator

Thx for the report.

Diving into the code it looks like the sample that is analysed to "guess" the type of the column is hardcoded to 100 rows here.
I can reproduce with a csv file with 1 column, 100 rows of zeros followed by a decimal value.

Can you confirm that your data starts with at least 100 lines of zeros ?

Unfortunately I can't think of a workaround right now... Can I ask you what your use case is ? Is it for validation ?

@mingjiecn
Copy link
Author

Yes the first several hundred rows are 0s. We use frictionless to validate big tsv files. Right now what I do is to skip the type error if no schema is provided. Let me know if there is a better way. Thank you!

@pierrecamilleri
Copy link
Collaborator

Thx for your feedback. The only way I see is correcting the output of describe inside a schema - but of course your answer shows you thought of that.

Actually the hard coded SAMPLE_SIZE does not seem to be the culprit.

The following csv fails already despite being less than 100 rows :

a,b
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
1.2,3.4

I tried the following command, which fails as well :

frictionless describe --sample-size=11 --field-confidence=1 test.csv

So there is something wrong here, I need to investigate further.

@pierrecamilleri pierrecamilleri added the bug Something isn't working label Oct 11, 2024
@mingjiecn
Copy link
Author

Please keep me updated. Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants