Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Lake: unhandled column statistics for one column break all column statistics #24487

Open
Pluies opened this issue Dec 16, 2024 · 0 comments · May be fixed by #24489
Open

Delta Lake: unhandled column statistics for one column break all column statistics #24487

Pluies opened this issue Dec 16, 2024 · 0 comments · May be fixed by #24489
Labels
delta-lake Delta Lake connector

Comments

@Pluies
Copy link
Contributor

Pluies commented Dec 16, 2024

Delta Lake entries contain column statistics that can be used by the Trino optimiser to speed up queries.

Only a subset of data types are supported in stats:

try {
if (type.equals(BOOLEAN)) {
if (valueString.equalsIgnoreCase("true")) {
return true;
}
if (valueString.equalsIgnoreCase("false")) {
return false;
}
}
if (type.equals(INTEGER)) {
return (long) parseInt(valueString);
}
if (type.equals(SMALLINT)) {
return (long) parseInt(valueString);
}
if (type.equals(TINYINT)) {
return (long) parseInt(valueString);
}
if (type.equals(BIGINT)) {
return parseLong(valueString);
}
if (type instanceof DecimalType decimalType) {
return parseDecimal(decimalType, valueString);
}
if (type.equals(REAL)) {
return (long) floatToRawIntBits(parseFloat(valueString));
}
if (type.equals(DOUBLE)) {
return parseDouble(valueString);
}
if (type.equals(DATE)) {
// date values are represented as yyyy-MM-dd
return LocalDate.parse(valueString).toEpochDay();
}
if (type.equals(TIMESTAMP_MICROS)) {
return timestampReader.apply(valueString);
}
if (type.equals(TIMESTAMP_TZ_MILLIS)) {
return timestampWithZoneReader.apply(valueString);
}
if (VARCHAR.equals(type)) {
return utf8Slice(valueString);
}
}

When stats do not match one of these columns, we throw an exception:

// Anything else is not a supported DeltaLake column
throw new TrinoException(
GENERIC_INTERNAL_ERROR,
format("Unable to parse value [%s] from column %s with type %s", valueString, column.baseColumnName(), column.baseType()));

Which gives this stack trace in practice:

	at java.base/java.lang.Thread.run(Thread.java:1570)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at io.trino.$gen.Trino_435_3370_ga3323af____20241216_070309_2.run(Unknown Source)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1137)
	at io.airlift.concurrent.MoreFutures$3.onSuccess(MoreFutures.java:545)
	at io.airlift.concurrent.MoreFutures.lambda$addSuccessCallback$12(MoreFutures.java:570)
	at io.trino.dispatcher.LocalDispatchQuery.lambda$waitForMinimumWorkers$2(LocalDispatchQuery.java:134)
	at io.trino.dispatcher.LocalDispatchQuery.startExecution(LocalDispatchQuery.java:150)
	at io.trino.execution.SqlQueryManager.createQuery(SqlQueryManager.java:272)
	at io.trino.execution.SqlQueryExecution.start(SqlQueryExecution.java:416)
	at io.trino.execution.SqlQueryExecution.planQuery(SqlQueryExecution.java:478)
	at io.trino.execution.SqlQueryExecution.doPlanQuery(SqlQueryExecution.java:498)
	at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:233)
	at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:238)
	at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:266)
	at io.trino.sql.planner.LogicalPlanner.runOptimizer(LogicalPlanner.java:303)
	at io.trino.sql.planner.optimizations.StatsRecordingPlanOptimizer.optimize(StatsRecordingPlanOptimizer.java:41)
	at io.trino.sql.planner.optimizations.DeterminePartitionCount.optimize(DeterminePartitionCount.java:120)
	at io.trino.sql.planner.optimizations.DeterminePartitionCount.determinePartitionCount(DeterminePartitionCount.java:174)
	at io.trino.sql.planner.optimizations.DeterminePartitionCount.getPartitionCountBasedOnRows(DeterminePartitionCount.java:227)
	at io.trino.sql.planner.optimizations.DeterminePartitionCount.getSourceNodesOutputStats(DeterminePartitionCount.java:299)
	at java.base/java.util.stream.DoublePipeline.sum(DoublePipeline.java:450)
	at java.base/java.util.stream.DoublePipeline.collect(DoublePipeline.java:541)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:546)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:556)
	at java.base/java.util.Collections$2.forEachRemaining(Collections.java:5082)
	at java.base/java.util.Collections$2.tryAdvance(Collections.java:5074)
	at java.base/java.util.stream.ReferencePipeline$6$1.accept(ReferencePipeline.java:263)
	at io.trino.sql.planner.optimizations.DeterminePartitionCount.lambda$getPartitionCountBasedOnRows$3(DeterminePartitionCount.java:227)
	at io.trino.cost.CachingStatsProvider.getStats(CachingStatsProvider.java:87)
	at io.trino.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:71)
	at io.trino.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:83)
	at io.trino.cost.SimpleStatsRule.calculate(SimpleStatsRule.java:37)
	at io.trino.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:36)
	at io.trino.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:60)
	at io.trino.cost.CachingTableStatsProvider.getTableStatistics(CachingTableStatsProvider.java:46)
	at io.trino.tracing.TracingMetadata.getTableStatistics(TracingMetadata.java:311)
	at io.trino.metadata.MetadataManager.getTableStatistics(MetadataManager.java:477)
	at io.trino.tracing.TracingConnectorMetadata.getTableStatistics(TracingConnectorMetadata.java:331)
	at com.dune.trino.metastore.AbstractDelegatingConnectorMetadata.getTableStatistics(AbstractDelegatingConnectorMetadata.kt:262)
	at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.getTableStatistics(ClassLoaderSafeConnectorMetadata.java:354)
	at io.trino.plugin.deltalake.DeltaLakeMetadata.getTableStatistics(DeltaLakeMetadata.java:899)
	at io.trino.plugin.deltalake.statistics.FileBasedTableStatisticsProvider.getTableStatistics(FileBasedTableStatisticsProvider.java:174)
	at io.trino.plugin.deltalake.transactionlog.statistics.DeltaLakeJsonFileStatistics.getMinColumnValue(DeltaLakeJsonFileStatistics.java:136)
	at java.base/java.util.Optional.flatMap(Optional.java:289)
	at io.trino.plugin.deltalake.transactionlog.statistics.DeltaLakeJsonFileStatistics.lambda$getMinColumnValue$4(DeltaLakeJsonFileStatistics.java:136)
	at io.trino.plugin.deltalake.transactionlog.statistics.DeltaLakeJsonFileStatistics.deserializeStatisticsValue(DeltaLakeJsonFileStatistics.java:144)
	at io.trino.plugin.deltalake.transactionlog.TransactionLogParser.deserializeColumnValue(TransactionLogParser.java:246)
io.trino.spi.TrinoException: Unable to parse value [\x00\x03\x01\x0f\x932\xd3\xf5\xe8\x11\x9cY\xd0\xd6\x92v\xa7c,\xd8] from column hash with type varbinary
2024-12-16T14:54:24.199Z	ERROR	Query-20241216_145422_23066_5cgep-43942	io.trino.cost.CachingStatsProvider	Error occurred when computing stats for query 20241216_145422_23066_5cgep

When this happens, all stats for the query are ignored:

catch (RuntimeException e) {
if (isIgnoreStatsCalculatorFailures(session)) {
log.error(e, "Error occurred when computing stats for query %s", session.getQueryId());
return PlanNodeStatsEstimate.unknown();
}
throw e;
}

Which makes it impossible for Trino to optimise based on stats at all, even if stats for some other columns are available.

We've noticed that some implementations (namely delta_rs) write stats for more types than Trino supports, for example in the stacktrace above a stat for a VARBINARY column.

Until Trino adds official support for using these stats, let's consider stats to be unavailable for these specific columns rather than blow up and not consider stats at all?

@Pluies Pluies added the delta-lake Delta Lake connector label Dec 16, 2024
@Pluies Pluies linked a pull request Dec 16, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
delta-lake Delta Lake connector
Development

Successfully merging a pull request may close this issue.

1 participant