-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude reading _pos column if it's not in the scan list #11390
base: main
Are you sure you want to change the base?
Conversation
also cc @flyrain |
@huaxingao: I'm not an expert in the Spark codebase, but I think having a test which fails before the change and succeeds after the change would be nice. Otherwise we risk future PRs changing this behaviour without the reviewers noticing it. |
@huaxingao its a good find, im just wondering, where do we add _pos to the schema? Can we just not do it there? Just curious if its possible |
I think it might be from here iceberg/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java Lines 260 to 263 in 684f7a7
|
@szehon-ho I think we still need the |
@pvary Thank you for your suggestion! You're correct that adding such a test would help prevent future changes from inadvertently affecting this behavior without notice. Currently, Spark doesn't check the schema when processing batch data, which is why an extra Arrow vector in |
// vectorization reader. Before removing _pos, we need to make sure _pos is not explicitly | ||
// selected in the query. | ||
if (deleteFilter != null) { | ||
if (deleteFilter.hasPosDeletes() && expectedSchema().findType("_pos") == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a const for _pos
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. Changed to MetadataColumns.ROW_POSITION.name()
|
||
return Parquet.read(inputFile) | ||
.project(requiredSchema) | ||
.split(start, length) | ||
.createBatchedReaderFunc( | ||
fileSchema -> | ||
VectorizedSparkParquetReaders.buildReader( | ||
requiredSchema, fileSchema, idToConstant, deleteFilter)) | ||
vectorizationSchema, fileSchema, idToConstant, deleteFilter)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just expectedSchema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly.
If no deletes, it's expectedSchema
.
If it's equality delete, it's deleteFilter.requiredSchema()
, because it could be expectedSchema
+ equality filter column. For example:
SELECT id FROM table
supposed the equality delete has data == 'aaa'
then we do need to read the data
column too, so it's deleteFilter.requiredSchema()
, which is id
+ data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of pos delete, I can't use expectedSchema
either, because it could be both pos delete and equality delete.
Sorry I still wanted to see if it can be done earlier, what do you think https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java#L99. this is for vectorized path right? Can we pass in a flag to SparkDeleteFilter to not add the column? It seems a bit wasteful to remove the column after adding it, so just wanted to explore it. Also the current fix is specific to Parquet, how about ORC? |
@szehon-ho Thanks for the comment. We actually also use the requiredSchema, that's the schema with the We can pass in a flag to ORC uses expectedSchema(), the schema without _pos column, to build vectorized readers. |
Hm then can we we just add _pos to requiredSchema (with a comment) at https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java#L83? To me adding it earlier and then removing it later is worse for code understanding, than just adding it when needed. Probably cleaner with a flag to ReadConf to trigger the generateOffSetToStartPos but not sure if its feasible. fyi @aokolnychyi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @huaxingao for working on it. Sorry for the delay. Left some comments.
SparkDeleteFilter sparkDeleteFilter = | ||
new SparkDeleteFilter(filePath, task.deletes(), counter(), true); | ||
|
||
SparkDeleteFilter deleteFilter = task.deletes().isEmpty() ? null : sparkDeleteFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need delete filter if task.deletes().isEmpty()
, but the new code always creates a filter object. How about this?
filter = null;
if(!task.deletes().isEmpty()) {
filter = new filter()
}
@@ -69,7 +69,8 @@ protected DeleteFilter( | |||
List<DeleteFile> deletes, | |||
Schema tableSchema, | |||
Schema requestedSchema, | |||
DeleteCounter counter) { | |||
DeleteCounter counter, | |||
boolean isBatchReading) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does batch reading necessarily mean no _pos
? I believe not, as you can explicitly project it, like select _pos from t
. We should give an accurate name, something like needRowPosition
or needRowPosCol
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think a bit more, I'm considering if only the non-vector readers actually need an implicit _pos
column. If that’s the case, would it make more sense to adjust this within RowReader by adding _pos
there? This approach could simplify things by eliminating the need to check whether a reader is vectorized, especially since vectorization isn’t necessarily strongly correlated with the requirement for _pos
.
Here is pseudo code to add it in class RowDataReader
LOG.debug("Opening data file {}", filePath);
expectedSchema().add(`_pos`); <--- add it here
SparkDeleteFilter deleteFilter =
new SparkDeleteFilter(filePath, task.deletes(), counter(), false);
@@ -93,7 +94,8 @@ protected DeleteFilter( | |||
|
|||
this.posDeletes = posDeleteBuilder.build(); | |||
this.eqDeletes = eqDeleteBuilder.build(); | |||
this.requiredSchema = fileProjection(tableSchema, requestedSchema, posDeletes, eqDeletes); | |||
this.requiredSchema = | |||
fileProjection(tableSchema, requestedSchema, posDeletes, eqDeletes, isBatchReading); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question I asked myself is whether it impacts the metadata column read? It seems not, but the method DeleteFilter::fileProjection
seems a bit hard to read, we can refactor it later. It makes more sense to make it a static until method instead of instance method. Plus, it's a bit weird to pass the schema to delete filter, then get it back from the filter. This seems something we can improve on it as a follow up.
private Map<Long, Long> generateOffsetToStartPos(Schema schema) { | ||
if (schema.findField(MetadataColumns.ROW_POSITION.fieldId()) == null) { | ||
private Map<Long, Long> generateOffsetToStartPos() { | ||
if (hasPositionDelete) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend to do it in the caller like this
Map<Long, Long> offsetToStartPos = hasPositionDelete? generateOffsetToStartPos(): null;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -98,6 +98,7 @@ private CloseableIterable<InternalRow> newParquetIterable( | |||
.filter(residual) | |||
.caseSensitive(caseSensitive()) | |||
.withNameMapping(nameMapping()) | |||
.hasPositionDelete(readSchema.findField(MetadataColumns.ROW_POSITION.fieldId()) == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need it for row reader? Can we use a default value here?
@@ -97,6 +98,7 @@ private CloseableIterable<ColumnarBatch> newParquetIterable( | |||
// read performance as every batch read doesn't have to pay the cost of allocating memory. | |||
.reuseContainers() | |||
.withNameMapping(nameMapping()) | |||
.hasPositionDelete(hasPositionDelete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if withPositionDelete
is more suitable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key here is determining whether we want to compute row offsets specifically for the filtered row groups. This decision doesn’t have to be directly tied to the presence of position deletes. Perhaps a name like needRowGroupOffset
would better capture this intent and improve clarity.
Thanks a lot @flyrain and @szehon-ho for your review! I've thought this over: I feel the original change is much simpler. If the original one looks OK to you, I will revert to it. Thanks a lot! |
In Spark batch reading, Iceberg reads additional columns when there are delete files. For instance, if we have a table
test (int id, string data)
and a querySELECT id FROM test
, the requested schema only contains the columnid
. However, to determine which rows are deleted (there is arowIdMapping
for this purpose), Iceberg appends_pos
to the requested schema for position deletes, and append the equality filter column for equality deletes (suppose the equality delete is on column data). As a result, Iceberg will haveColumnarBatchReader
s for these extra columns. In the case of position deletes, we actually don't need to read_pos
to compute therowIdMapping
, so this PR excludes the_pos
columns when building theColumnarBatchReader
. For equality deletes, while we need to read the equality filter column to compute therowIdMapping
, once we have therowIdMapping
, we should exclude the values of these extra columns from theColumnarBatch
. I will have a separate PR to fix equality delete.In summary:
For position delete, the vectorized reader currently returns a
ColumnarBatch
that contains arrow vectors for bothid
and_pos
. This PR will make iceberg not read_pos
column, so the returnedColumnarBatch
only contains an arrow vector forid
only.For equality delete (suppose the filter is on
data
column, the vectorized reader currently returns aColumnarBatch
that contains arrow vectors for bothid
anddata
. The goal is to return aColumnarBatch
that contains an arrow vector forid
only. I will have a separate PR for this.