You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ensure the transaction log stores metadata stats for all the columns that benefit from file skipping.
And here guides us to explicitly specify columns for file skipping:
It takes some time to compute column statistics when writing files, and it isn’t worth the effort if you cannot use the column for file skipping.
Suppose you have a table column containing a long string of arbitrary text. It’s unlikely that this column would ever provide any data-skipping benefits. So, you can just avoid the overhead of collecting the statistics for this particular column.
I couldn't find a way to exclude some columns from file skipping purposes. In my case, I have quite wide tables (> 200 columns), and out of these, 195 will never be used for file skipping.
An unintended consequence of including all these columns in statistics calculation is the explosive growth of the delta log size because it writes very long strings of min/max/nulls. This indirectly creates all conditions for issue #2301 and this from delta slack.
I found the delta.dataSkippingNumIndexedCols setting here and I wonder if it's possible to explicitly specify column names. Or should I reorder the table to have skipping columns first and then set delta.dataSkippingNumIndexedCols?
The text was updated successfully, but these errors were encountered:
…2428)
# Description
All of the Rust and Python write actions will now properly adhere to the
configuration regarding the amount of columns stats have to be collected
for. Either by dataSkippingNumIndexedCols or dataSkippingStatsColumns.
# Related Issue(s)
- closes#2427
---------
Co-authored-by: R. Tyler Croy <[email protected]>
Environment
Delta-rs version: 0.16.4
Binding: python
Description
Here documentation says:
And here guides us to explicitly specify columns for file skipping:
I couldn't find a way to exclude some columns from file skipping purposes. In my case, I have quite wide tables (> 200 columns), and out of these, 195 will never be used for file skipping.
An unintended consequence of including all these columns in statistics calculation is the explosive growth of the delta log size because it writes very long strings of min/max/nulls. This indirectly creates all conditions for issue #2301 and this from delta slack.
I found the
delta.dataSkippingNumIndexedCols
setting here and I wonder if it's possible to explicitly specify column names. Or should I reorder the table to have skipping columns first and then set delta.dataSkippingNumIndexedCols?The text was updated successfully, but these errors were encountered: