Exclude reading _pos column if it's not in the scan list #11390

huaxingao · 2024-10-24T23:26:13Z

In Spark batch reading, Iceberg reads additional columns when there are delete files. For instance, if we have a table
test (int id, string data) and a query SELECT id FROM test, the requested schema only contains the column id. However, to determine which rows are deleted (there is a rowIdMapping for this purpose), Iceberg appends _pos to the requested schema for position deletes, and append the equality filter column for equality deletes (suppose the equality delete is on column data). As a result, Iceberg will have ColumnarBatchReaders for these extra columns. In the case of position deletes, we actually don't need to read _pos to compute the rowIdMapping, so this PR excludes the _pos columns when building the ColumnarBatchReader. For equality deletes, while we need to read the equality filter column to compute the rowIdMapping, once we have the rowIdMapping, we should exclude the values of these extra columns from the ColumnarBatch. I will have a separate PR to fix equality delete.

In summary:

SELECT id FROM test

For position delete, the vectorized reader currently returns a ColumnarBatch that contains arrow vectors for both id and _pos. This PR will make iceberg not read _pos column, so the returned ColumnarBatch only contains an arrow vector for id only.

For equality delete (suppose the filter is on data column, the vectorized reader currently returns a ColumnarBatch that contains arrow vectors for both id and data. The goal is to return a ColumnarBatch that contains an arrow vector for id only. I will have a separate PR for this.

huaxingao · 2024-10-25T00:19:09Z

cc @szehon-ho @pvary @viirya

huaxingao · 2024-10-25T00:27:22Z

also cc @flyrain

pvary · 2024-10-25T13:19:11Z

@huaxingao: I'm not an expert in the Spark codebase, but I think having a test which fails before the change and succeeds after the change would be nice. Otherwise we risk future PRs changing this behaviour without the reviewers noticing it.

szehon-ho · 2024-10-25T21:30:04Z

@huaxingao its a good find, im just wondering, where do we add _pos to the schema? Can we just not do it there? Just curious if its possible

dramaticlly · 2024-10-25T21:36:16Z

@huaxingao its a good find, im just wondering, where do we add _pos to the schema? Can we just not do it there? Just curious if its possible

I think it might be from here

iceberg/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

Lines 260 to 263 in 684f7a7

    
           if (!posDeletes.isEmpty()) { 
        
             requiredIds.add(MetadataColumns.ROW_POSITION.fieldId()); 
        
           }

huaxingao · 2024-10-25T21:58:36Z

@szehon-ho I think we still need the _pos in the requiredSchema to build posAccessor.

huaxingao · 2024-10-25T22:27:24Z

@pvary Thank you for your suggestion! You're correct that adding such a test would help prevent future changes from inadvertently affecting this behavior without notice. Currently, Spark doesn't check the schema when processing batch data, which is why an extra Arrow vector in ColumnarBatch doesn't cause error. However, Comet allocates arrays in a pre-allocated list and relies on the requested schema to determine how many columns are in the batch. If extra columns are returned to Comet, it will fail. While we currently don't have a test that fails due to extra columns, the integration of Comet will change this. Once Comet is integrated, the tests involving the Comet reader will fail if extra columns are present. I believe these Comet reader tests will serve as the tests you've suggested we add.

viirya · 2024-10-25T22:42:06Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

+    // vectorization reader. Before removing _pos, we need to make sure _pos is not explicitly
+    // selected in the query.
+    if (deleteFilter != null) {
+      if (deleteFilter.hasPosDeletes() && expectedSchema().findType("_pos") == null) {


Do we have a const for _pos?

Yes, we do. Changed to MetadataColumns.ROW_POSITION.name()

viirya · 2024-10-25T23:18:33Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java


    return Parquet.read(inputFile)
        .project(requiredSchema)
        .split(start, length)
        .createBatchedReaderFunc(
            fileSchema ->
                VectorizedSparkParquetReaders.buildReader(
-                    requiredSchema, fileSchema, idToConstant, deleteFilter))
+                    vectorizationSchema, fileSchema, idToConstant, deleteFilter))


Is it just expectedSchema?

Not exactly.
If no deletes, it's expectedSchema.
If it's equality delete, it's deleteFilter.requiredSchema(), because it could be expectedSchema + equality filter column. For example:

SELECT id FROM table

supposed the equality delete has data == 'aaa'
then we do need to read the data column too, so it's deleteFilter.requiredSchema(), which is id + data

In the case of pos delete, I can't use expectedSchema either, because it could be both pos delete and equality delete.

szehon-ho · 2024-11-01T01:24:40Z

Sorry I still wanted to see if it can be done earlier, what do you think https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java#L99. this is for vectorized path right? Can we pass in a flag to SparkDeleteFilter to not add the column?

It seems a bit wasteful to remove the column after adding it, so just wanted to explore it.

Also the current fix is specific to Parquet, how about ORC?

huaxingao · 2024-11-01T03:49:30Z

@szehon-ho Thanks for the comment.

We actually also use the requiredSchema, that's the schema with the _pos column. In ReadConf#generateOffsetToStartPos, we actually need to know if pos delete exists.

We can pass in a flag to SparkDeleteFilter to not add the _pos column, but then I think we need to add another flag to pass the hasPosDelete info to Parquet ReaderBuilder, and then pass to ReadConfig.

ORC uses expectedSchema(), the schema without _pos column, to build vectorized readers.

szehon-ho · 2024-11-01T06:34:34Z

Hm then can we we just add _pos to requiredSchema (with a comment) at https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java#L83?

To me adding it earlier and then removing it later is worse for code understanding, than just adding it when needed.

Probably cleaner with a flag to ReadConf to trigger the generateOffSetToStartPos but not sure if its feasible. fyi @aokolnychyi

flyrain

Thanks @huaxingao for working on it. Sorry for the delay. Left some comments.

flyrain · 2024-11-02T17:11:35Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java

+    SparkDeleteFilter sparkDeleteFilter =
+        new SparkDeleteFilter(filePath, task.deletes(), counter(), true);
+
+    SparkDeleteFilter deleteFilter = task.deletes().isEmpty() ? null : sparkDeleteFilter;


We don't need delete filter if task.deletes().isEmpty(), but the new code always creates a filter object. How about this?

filter = null; if(!task.deletes().isEmpty()) { filter = new filter() }

flyrain · 2024-11-02T17:17:15Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

@@ -69,7 +69,8 @@ protected DeleteFilter(
      List<DeleteFile> deletes,
      Schema tableSchema,
      Schema requestedSchema,
-      DeleteCounter counter) {
+      DeleteCounter counter,
+      boolean isBatchReading) {


Does batch reading necessarily mean no _pos? I believe not, as you can explicitly project it, like select _pos from t. We should give an accurate name, something like needRowPosition or needRowPosCol.

Think a bit more, I'm considering if only the non-vector readers actually need an implicit _pos column. If that’s the case, would it make more sense to adjust this within RowReader by adding _pos there? This approach could simplify things by eliminating the need to check whether a reader is vectorized, especially since vectorization isn’t necessarily strongly correlated with the requirement for _pos.
Here is pseudo code to add it in class RowDataReader

LOG.debug("Opening data file {}", filePath); expectedSchema().add(`_pos`); <--- add it here SparkDeleteFilter deleteFilter = new SparkDeleteFilter(filePath, task.deletes(), counter(), false);

flyrain · 2024-11-02T17:31:05Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

@@ -93,7 +94,8 @@ protected DeleteFilter(

    this.posDeletes = posDeleteBuilder.build();
    this.eqDeletes = eqDeleteBuilder.build();
-    this.requiredSchema = fileProjection(tableSchema, requestedSchema, posDeletes, eqDeletes);
+    this.requiredSchema =
+        fileProjection(tableSchema, requestedSchema, posDeletes, eqDeletes, isBatchReading);


One question I asked myself is whether it impacts the metadata column read? It seems not, but the method DeleteFilter::fileProjection seems a bit hard to read, we can refactor it later. It makes more sense to make it a static until method instead of instance method. Plus, it's a bit weird to pass the schema to delete filter, then get it back from the filter. This seems something we can improve on it as a follow up.

flyrain · 2024-11-02T17:37:47Z

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

-  private Map<Long, Long> generateOffsetToStartPos(Schema schema) {
-    if (schema.findField(MetadataColumns.ROW_POSITION.fieldId()) == null) {
+  private Map<Long, Long> generateOffsetToStartPos() {
+    if (hasPositionDelete) {


I'd recommend to do it in the caller like this

Map<Long, Long> offsetToStartPos = hasPositionDelete? generateOffsetToStartPos(): null;

BTW, we don't actually need this if this PR #10107 is in, which removes related complexity. The parquet side is ready for a while. We may get #10107 in.

flyrain · 2024-11-02T17:41:58Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseRowReader.java

@@ -98,6 +98,7 @@ private CloseableIterable<InternalRow> newParquetIterable(
        .filter(residual)
        .caseSensitive(caseSensitive())
        .withNameMapping(nameMapping())
+        .hasPositionDelete(readSchema.findField(MetadataColumns.ROW_POSITION.fieldId()) == null)


Do we need it for row reader? Can we use a default value here?

flyrain · 2024-11-02T17:42:24Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

@@ -97,6 +98,7 @@ private CloseableIterable<ColumnarBatch> newParquetIterable(
        // read performance as every batch read doesn't have to pay the cost of allocating memory.
        .reuseContainers()
        .withNameMapping(nameMapping())
+        .hasPositionDelete(hasPositionDelete)


Wondering if withPositionDelete is more suitable

The key here is determining whether we want to compute row offsets specifically for the filtered row groups. This decision doesn’t have to be directly tied to the presence of position deletes. Perhaps a name like needRowGroupOffset would better capture this intent and improve clarity.

huaxingao · 2024-11-02T19:09:57Z

Thanks a lot @flyrain and @szehon-ho for your review! I've thought this over: I feel the original change is much simpler. If the original one looks OK to you, I will revert to it. Thanks a lot!

…e DeleteFilter schema

Exclude reading pos_ column if it's not in the scan list

9b4e56d

github-actions bot added the spark label Oct 24, 2024

viirya reviewed Oct 25, 2024

View reviewed changes

use MetadataColumns.ROW_POSITION.name() instead of _pos

fda2b3a

viirya reviewed Oct 25, 2024

View reviewed changes

huaxingao changed the title ~~Exclude reading pos_ column if it's not in the scan list~~ Exclude reading _pos column if it's not in the scan list Oct 26, 2024

github-actions bot added parquet data labels Nov 1, 2024

huaxingao force-pushed the pos branch from a3bb1c2 to 4f3a836 Compare November 2, 2024 04:08

try a differnt implementation

5f7fed8

huaxingao force-pushed the pos branch from 658773d to 5f7fed8 Compare November 2, 2024 04:24

flyrain reviewed Nov 2, 2024

View reviewed changes

address comments

c6754b3

huaxingao force-pushed the pos branch from 075365f to c6754b3 Compare November 5, 2024 04:43

huaxingao added 2 commits November 4, 2024 21:40

add boolean needRowPosCol to indicate if _pos needs to be added to th…

0cac0b8

…e DeleteFilter schema

fix 3.3 and 3.4 BaseBatchReader

2c0b1e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude reading _pos column if it's not in the scan list #11390

Exclude reading _pos column if it's not in the scan list #11390

huaxingao commented Oct 24, 2024

huaxingao commented Oct 25, 2024

huaxingao commented Oct 25, 2024

pvary commented Oct 25, 2024

szehon-ho commented Oct 25, 2024

dramaticlly commented Oct 25, 2024

huaxingao commented Oct 25, 2024

huaxingao commented Oct 25, 2024

viirya Oct 25, 2024

huaxingao Oct 25, 2024

viirya Oct 25, 2024

huaxingao Oct 25, 2024

huaxingao Oct 26, 2024

szehon-ho commented Nov 1, 2024 •

edited

Loading

huaxingao commented Nov 1, 2024

szehon-ho commented Nov 1, 2024 •

edited

Loading

flyrain left a comment

flyrain Nov 2, 2024 •

edited

Loading

flyrain Nov 2, 2024

flyrain Nov 2, 2024

flyrain Nov 2, 2024 •

edited

Loading

flyrain Nov 2, 2024

flyrain Nov 3, 2024 •

edited

Loading

flyrain Nov 2, 2024

flyrain Nov 2, 2024

flyrain Nov 2, 2024

huaxingao commented Nov 2, 2024

Exclude reading _pos column if it's not in the scan list #11390

Are you sure you want to change the base?

Exclude reading _pos column if it's not in the scan list #11390

Conversation

huaxingao commented Oct 24, 2024

huaxingao commented Oct 25, 2024

huaxingao commented Oct 25, 2024

pvary commented Oct 25, 2024

szehon-ho commented Oct 25, 2024

dramaticlly commented Oct 25, 2024

huaxingao commented Oct 25, 2024

huaxingao commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho commented Nov 1, 2024 • edited Loading

huaxingao commented Nov 1, 2024

szehon-ho commented Nov 1, 2024 • edited Loading

flyrain left a comment

Choose a reason for hiding this comment

flyrain Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain Nov 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Nov 2, 2024

szehon-ho commented Nov 1, 2024 •

edited

Loading

szehon-ho commented Nov 1, 2024 •

edited

Loading

flyrain Nov 2, 2024 •

edited

Loading

flyrain Nov 2, 2024 •

edited

Loading

flyrain Nov 3, 2024 •

edited

Loading