Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve integration tests between implementations #51

Open
asfimport opened this issue Feb 17, 2021 · 3 comments
Open

Improve integration tests between implementations #51

asfimport opened this issue Feb 17, 2021 · 3 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Feb 17, 2021

We have a lack of proper integration tests between components. Fortunately, we already have a git repository to upload test data: https://github.com/apache/parquet-testing.

The idea is the following.
Create a directory structure for the different versions of the implementations containing parquet files with defined data. The structure definition shall be self-descriptive so we can write integration tests that reads the whole structure automatically and also works with files to be added later.

The following directory structure is an example for the previous requirements:


test-data/
├── impala
│   ├── 3.2.0
│   │   └── basic-data.parquet
│   ├── 3.3.0
│   │   └── basic-data.parquet
│   └── 3.4.0
│       ├── basic-data.lz4.parquet
│       ├── basic-data.snappy.parquet
│       ├── some-specific-issue-2.parquet
│       ├── some-specific-issue-3.csv
│       ├── some-specific-issue-3_mode1.parquet
│       ├── some-specific-issue-3_mode2.parquet
│       └── some-specific-issue-3.schema
├── parquet-cpp
│   ├── 1.5.0
│   │   ├── basic-data.lz4.parquet
│   │   └── basic-data.parquet
│   └── 1.6.0
│       ├── basic-data.lz4.parquet
│       └── some-specific-issue-2.parquet
├── parquet-mr
│   ├── 1.10.2
│   │   └── basic-data.parquet
│   ├── 1.11.1
│   │   ├── basic-data.parquet
│   │   └── some-specific-issue-1.parquet
│   ├── 1.12.0
│   │   ├── basic-data.br.parquet
│   │   ├── basic-data.lz4.parquet
│   │   ├── basic-data.snappy.parquet
│   │   ├── basic-data.zstd.parquet
│   │   ├── some-specific-issue-1.parquet
│   │   └── some-specific-issue-2.parquet
│   ├── some-specific-issue-1.csv
│   └── some-specific-issue-1.schema
├── basic-data.csv
├── basic-data.schema
├── some-specific-issue-2.csv
└── some-specific-issue-2.schema

Parquet files are created at leaf level. The expected data is saved in a csv format (to be specified: separators, how to save binary etc.), the expected schema (to specify the data types independently from the parquet files) are saved in .schema files. The csv and schema files can be saved on the same level of the parquet files or upper levels if they are common to several parquet files.

Any comments about the idea are welcomed.

Environment:
{noformat}
no further formatting is done here
{noformat}

Reporter: Gabor Szadovszky / @gszadovszky

Note: This issue was originally created as PARQUET-1985. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
I think trying to shoe horn structured data into CSV might not be worthwhile.  Instead I would propose either JSON, Protobuf (text representation) or Avro (probably json representation).

 

In arrow at least we've used both "gold files" like this suggests and a test harness that run commands from different language bindings with temporary data.  Both have been useful.

@asfimport
Copy link
Collaborator Author

Gabor Szadovszky / @gszadovszky:
@emkornfield, I agree CSV is not the best approach. I did not think about nested types. I think JSON is more wide-spread than Protobuf or Avro so it has a higher chance to get an easy to use library on for language. In addition JSON is human readable making debugging easier (for non-binary types). Meanwhile, JSON would be much larger than Protobuf/Avro files.

We might use any formats to store "gold data" we still need to properly specify the way. Do we want to test logical types as well? From e.g. Arrow/Impala point of view it make sense as they have the related types (e.g. timestamp, decimal). To validate these types we need to have data in they rich form (e.g. as a timestamp/decimal and not binary). Meanwhile, parquet-mr does not have support for these types so when we convert the binary values to these types we are not testing parquet-mr but the test itself. But maybe it is a parquet-mr related issue and we shall provide the widest set of data available.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Indeed, JSON sounds much better than CSV.

We probably want to test logical types as well, IMHO. Of course, implementations which don't support them will have to skip those tests. Meaning optional features such as logical types must probably use separate reference files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant