Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow importing parquet fields containing repeated elements. #8749

Merged
merged 3 commits into from
Jan 16, 2025

Conversation

nicktobey
Copy link
Contributor

NOTE: This still needs tests. I'm looking for a good tool for generating parquet. We can't use dolt table export to generate the parquet because we can't generate composite types that way.

This PR adds support for importing specific composite parquet types into Dolt. Specifically, we're now able to import a compose parquet field if:

  • There is exactly one leaf column in the field.
  • There is at most one repeated tag in the field.

We flatten these composite values into a single primitive value (if there are no repeated tags) or an array of primitive values (if there's exactly one repeated tag.)

There's more work to be done here (multidimensional arrays, objects, etc), but this allows us to import vector embedding stored in parquet files.

Why do we flatten the type?

We want to be able to import parquet files from HuggingFace, and store embedding sequences as arrays. Embedding sequences in HuggingFace exports are an optional field containing a single repeated child field, which itself contains a single optional field containing the sequence element. Flattening this into a single array is more usable and doesn't lose any data.

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
75dd73f ok 5937457
version total_tests
75dd73f 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
d1eba8f ok 5937457
version total_tests
d1eba8f 5937457
correctness_percentage
100.0

Copy link
Contributor

@jennifersp jennifersp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@coffeegoddd
Copy link
Contributor

@nicktobey DOLT

comparing_percentages
100.000000 to 100.000000
version result total
63eddeb ok 5937457
version total_tests
63eddeb 5937457
correctness_percentage
100.0

@nicktobey nicktobey merged commit c5a7a11 into main Jan 16, 2025
21 checks passed
@nicktobey nicktobey deleted the nicktobey/parquet branch January 16, 2025 00:52
@coffeegoddd
Copy link
Contributor

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000
version result total
09d42a4 ok 5937457
version total_tests
09d42a4 5937457
correctness_percentage
100.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants