File skipping according to documentation #2427

t1g0rz · 2024-04-20T17:01:21Z

Environment

Delta-rs version: 0.16.4

Binding: python

Description

Here documentation says:

Ensure the transaction log stores metadata stats for all the columns that benefit from file skipping.

And here guides us to explicitly specify columns for file skipping:

It takes some time to compute column statistics when writing files, and it isn’t worth the effort if you cannot use the column for file skipping.
Suppose you have a table column containing a long string of arbitrary text. It’s unlikely that this column would ever provide any data-skipping benefits. So, you can just avoid the overhead of collecting the statistics for this particular column.

I couldn't find a way to exclude some columns from file skipping purposes. In my case, I have quite wide tables (> 200 columns), and out of these, 195 will never be used for file skipping.
An unintended consequence of including all these columns in statistics calculation is the explosive growth of the delta log size because it writes very long strings of min/max/nulls. This indirectly creates all conditions for issue #2301 and this from delta slack.

I found the delta.dataSkippingNumIndexedCols setting here and I wonder if it's possible to explicitly specify column names. Or should I reorder the table to have skipping columns first and then set delta.dataSkippingNumIndexedCols?

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2024-04-20T17:15:49Z

@t1g0rz I'm not sure if we are properly respecting these configurations yet.

To select only columns this should be the one: delta.dataSkippingStatsColumns.

But again I'm not sure if we use this yet correctly, I'll take a deeper look into this tomorrow

ion-elgreco · 2024-04-21T09:56:38Z

@t1g0rz I checked, but we don't respect it yet 😄

Buttt I am working on a fix 🕺

…2428) # Description All of the Rust and Python write actions will now properly adhere to the configuration regarding the amount of columns stats have to be collected for. Either by dataSkippingNumIndexedCols or dataSkippingStatsColumns. # Related Issue(s) - closes #2427 --------- Co-authored-by: R. Tyler Croy <[email protected]>

t1g0rz added the bug Something isn't working label Apr 20, 2024

ion-elgreco mentioned this issue Apr 21, 2024

feat(python, rust): respect column stats collection configurations #2428

Merged

ion-elgreco closed this as completed in #2428 May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File skipping according to documentation #2427

File skipping according to documentation #2427

t1g0rz commented Apr 20, 2024

ion-elgreco commented Apr 20, 2024

ion-elgreco commented Apr 21, 2024

File skipping according to documentation #2427

File skipping according to documentation #2427

Comments

t1g0rz commented Apr 20, 2024

Environment

Description

ion-elgreco commented Apr 20, 2024

ion-elgreco commented Apr 21, 2024