Improve integration tests between implementations #51

asfimport · 2021-02-17T10:11:57Z

We have a lack of proper integration tests between components. Fortunately, we already have a git repository to upload test data: https://github.com/apache/parquet-testing.

The idea is the following.
Create a directory structure for the different versions of the implementations containing parquet files with defined data. The structure definition shall be self-descriptive so we can write integration tests that reads the whole structure automatically and also works with files to be added later.

The following directory structure is an example for the previous requirements:


test-data/
├── impala
│   ├── 3.2.0
│   │   └── basic-data.parquet
│   ├── 3.3.0
│   │   └── basic-data.parquet
│   └── 3.4.0
│       ├── basic-data.lz4.parquet
│       ├── basic-data.snappy.parquet
│       ├── some-specific-issue-2.parquet
│       ├── some-specific-issue-3.csv
│       ├── some-specific-issue-3_mode1.parquet
│       ├── some-specific-issue-3_mode2.parquet
│       └── some-specific-issue-3.schema
├── parquet-cpp
│   ├── 1.5.0
│   │   ├── basic-data.lz4.parquet
│   │   └── basic-data.parquet
│   └── 1.6.0
│       ├── basic-data.lz4.parquet
│       └── some-specific-issue-2.parquet
├── parquet-mr
│   ├── 1.10.2
│   │   └── basic-data.parquet
│   ├── 1.11.1
│   │   ├── basic-data.parquet
│   │   └── some-specific-issue-1.parquet
│   ├── 1.12.0
│   │   ├── basic-data.br.parquet
│   │   ├── basic-data.lz4.parquet
│   │   ├── basic-data.snappy.parquet
│   │   ├── basic-data.zstd.parquet
│   │   ├── some-specific-issue-1.parquet
│   │   └── some-specific-issue-2.parquet
│   ├── some-specific-issue-1.csv
│   └── some-specific-issue-1.schema
├── basic-data.csv
├── basic-data.schema
├── some-specific-issue-2.csv
└── some-specific-issue-2.schema

Parquet files are created at leaf level. The expected data is saved in a csv format (to be specified: separators, how to save binary etc.), the expected schema (to specify the data types independently from the parquet files) are saved in .schema files. The csv and schema files can be saved on the same level of the parquet files or upper levels if they are common to several parquet files.

Any comments about the idea are welcomed.

Environment:
{noformat}
no further formatting is done here
{noformat}

Reporter: Gabor Szadovszky / @gszadovszky

_{Note: This issue was originally created as PARQUET-1985. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-02-18T04:59:09Z

Micah Kornfield / @emkornfield:
I think trying to shoe horn structured data into CSV might not be worthwhile. Instead I would propose either JSON, Protobuf (text representation) or Avro (probably json representation).

In arrow at least we've used both "gold files" like this suggests and a test harness that run commands from different language bindings with temporary data. Both have been useful.

asfimport · 2021-02-18T08:49:20Z

Gabor Szadovszky / @gszadovszky:
@emkornfield, I agree CSV is not the best approach. I did not think about nested types. I think JSON is more wide-spread than Protobuf or Avro so it has a higher chance to get an easy to use library on for language. In addition JSON is human readable making debugging easier (for non-binary types). Meanwhile, JSON would be much larger than Protobuf/Avro files.

We might use any formats to store "gold data" we still need to properly specify the way. Do we want to test logical types as well? From e.g. Arrow/Impala point of view it make sense as they have the related types (e.g. timestamp, decimal). To validate these types we need to have data in they rich form (e.g. as a timestamp/decimal and not binary). Meanwhile, parquet-mr does not have support for these types so when we convert the binary values to these types we are not testing parquet-mr but the test itself. But maybe it is a parquet-mr related issue and we shall provide the widest set of data available.

asfimport · 2021-02-18T09:30:10Z

Antoine Pitrou / @pitrou:
Indeed, JSON sounds much better than CSV.

We probably want to test logical types as well, IMHO. Of course, implementations which don't support them will have to skip those tests. Meaning optional features such as logical types must probably use separate reference files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve integration tests between implementations #51

Improve integration tests between implementations #51

asfimport commented Feb 17, 2021 •

edited

Loading

asfimport commented Feb 18, 2021

asfimport commented Feb 18, 2021

asfimport commented Feb 18, 2021

Improve integration tests between implementations #51

Improve integration tests between implementations #51

Comments

asfimport commented Feb 17, 2021 • edited Loading

asfimport commented Feb 18, 2021

asfimport commented Feb 18, 2021

asfimport commented Feb 18, 2021

asfimport commented Feb 17, 2021 •

edited

Loading