Delete Files in Table Scans #630

sdd · 2024-09-13T06:57:11Z

I'm looking to start work on proper handling of delete files in table scans and so I'd like to open an issue to discuss some of the design decisions.

A core tenet of our approach so far has been to ensure that the tasks produced by the file plan are small, independent and self-contained, so that they can be easily distributed in architectures where the service that generates the file plan could be on a different machine to the service(s) that perform the file reads.

TheFileScanTask struct represents these individual units of work at present. Currently though, it's shape is focussed on Data files and it does not cater for including information on Delete files that are produced by the scan. Here's how it looks now, for reference:

iceberg-rust/crates/iceberg/src/scan.rs

Lines 859 to 886 in cde35ab

    
           /// A task to scan part of file. 
        
           #[derive(Debug, Clone, Serialize, Deserialize)] 
        
           pub struct FileScanTask { 
        
               /// The start offset of the file to scan. 
        
               pub start: u64, 
        
               /// The length of the file to scan. 
        
               pub length: u64, 
        
               /// The number of records in the file to scan. 
        
               /// 
        
               /// This is an optional field, and only available if we are 
        
               /// reading the entire data file. 
        
               pub record_count: Option<u64>, 
        
               /// The data file path corresponding to the task. 
        
               pub data_file_path: String, 
        
               /// The content type of the file to scan. 
        
               pub data_file_content: DataContentType, 
        
               /// The format of the file to scan. 
        
               pub data_file_format: DataFileFormat, 
        
               /// The schema of the file to scan. 
        
               pub schema: SchemaRef, 
        
               /// The field ids to project. 
        
               pub project_field_ids: Vec<i32>, 
        
               /// The predicate to filter. 
        
               #[serde(skip_serializing_if = "Option::is_none")] 
        
               pub predicate: Option<BoundPredicate>, 
        
           }

In order to properly process delete files as part of executing a scan task, executors will now need to load in any applicable delete files along with the data file that they are processing. I'll outline what happens now, and follow that by my proposed approach.

Current TableScan Synopsis

The current structure pushes all manifest file entries from the manifest list into a stream which we then process concurrently in order to retrieve their associated manifests. Once retrieved, each manifest then has each of it's manifest entries extracted and pushed onto a channel so that they can be processed in parallel. Each is embedded inside a context object that contains the relevant information that is needed for processing of the manifest entry. Tokio tasks listening to the channel then execute TableScan::process_manifest_entry on these objects, where we filter out any entries that do not match the scan filter predicate.
At this point, a FileScanTask is created for each of those entries that match the scan predicate. The FileScanTasks are then pushed into a channel that produces the stream of FileScanTasks that is returned to the original caller of plan_files.

Changes to `TableScan`

`FileScanTask`

Each FileScanTask represents a scan to be performed on a single data file. However, multiple delete files may need to be applied to any one data file. Additionally, the scope of applicability of delete files is any data file within the same partition of the delete file - i.e. the same delete file can need to be applied to multiple data files. Thus an executor needs to know not just the data file that it is processing, but all of the delete files that are applicable to that data file.

The first part of the set of changes that I'm proposing is refactor FileScanTask so that it represents a single data file and zero or more delete files.

The data_file_content property would be removed - each task is implicitly about a file of type Data.

A new struct, DeleteFileEntry, would be added. It would look something like this:

struct DeleteFileEntry {
    path: String,
    format: DataFileFormat
}

A delete_files property of typ Vec<DeleteFileEntry> would be added to FileScanTask to represent the delete files that are applicable to it's data file.

`TableScan::plan_files` and associated methods

We need to update this logic in order to ensure that we can properly populate this new delete_files property. Each ManifestEntryContext will need the list delete files so that if the manifest entry that it encapsulates passes the filtering steps, it can populate the new delete_files property when it constructs FileScanTask.

A naive approach may be to simply build a list of all of the delete files referred to by the top-level manifest list and give references to this list to all ManifestEntryContexts so that, if any delete files are present then all of them are included in every FileScanTask. This would be a good first step - code that works inefficiently is better than code that does not work at all! It would also permit work to proceed on the execution side.

Improvements could then be made to refine this approach to filter out inapplicable delete files that goes into each FileScanTask's delete_files property.

How does this sound so far, @liurenjie1024, @Xuanwo, @ZENOTME, @Fokko?

The text was updated successfully, but these errors were encountered:

xxhZs · 2024-09-14T08:27:30Z

Hi, I've recently implemented merge on read in my library using iceberg rust and submitted a working simplified version of the code, which looks somewhat similar to the A naive approach version you proposed! (I have to read the same delete file on my different nodes)

This pr is #625

About this issue. I have some doubts.
About FileScanTask {DeleteFileEntry}. as you said, the delete file and data file are many-to-many, so even if list delete file is saved in the file task, in the optimal case, the call still needs some special operations to make sure that all data file and delete file are dispatched to the same node, and that the delete file file is not read repeatedly. And most likely, this scheduling result is consistent with the partitioning result.
In this case, I prefer to expose the partitioning result directly in the file task. Please correct me if there is any misunderstanding

sdd · 2024-09-18T06:25:17Z

I'm happy to add the partitioning result to the task. This is useful to the executor node when deciding how to distribute tasks, as it enables the use of a few different strategies, the choice of which can be left to the implementer.

It is not necessarily the case that the delete file is read repeatedly if the delete file list is added to the file scan task, since we can store the parsed delete files inside the object cache, preventing them from being read repeatedly on the same node as they'd already be in memory. If the executor ensures that all tasks with the same partition get sent to the same executor, then the files would only be read once.

liurenjie1024 · 2024-09-26T09:54:31Z

Thanks @sdd for raising this. The general approach looks good to me. Challenging part of deletion file processing is to filter unnecessary deletion files in each task, which we can introduce as optimization later.

sdd · 2024-09-26T10:29:53Z

Thanks - I have some skeleton code for the required changes to reader.rs that I'm going to share over the next few days as well.

sdd · 2024-09-27T07:47:37Z

Thanks for taking a look at the above, @liurenjie1024. I've just submitted a draft PR which outlines the second part of the approach - how we extend the filtering in the arrow reader to handle delete files. #652

@liurenjie1024, @Xuanwo, @ZENOTME, @xxhZs: if you could take a look at that PR also when you get chance and let me know if you think that the approach seems sensible, that would be great!

c-thiel mentioned this issue Nov 16, 2024

Tracking issue: Writing iceberg tables #346

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete Files in Table Scans #630

Delete Files in Table Scans #630

sdd commented Sep 13, 2024

xxhZs commented Sep 14, 2024 •

edited

Loading

sdd commented Sep 18, 2024

liurenjie1024 commented Sep 26, 2024

sdd commented Sep 26, 2024

sdd commented Sep 27, 2024

Delete Files in Table Scans #630

Delete Files in Table Scans #630

Comments

sdd commented Sep 13, 2024

Current TableScan Synopsis

Changes to TableScan

FileScanTask

TableScan::plan_files and associated methods

xxhZs commented Sep 14, 2024 • edited Loading

sdd commented Sep 18, 2024

liurenjie1024 commented Sep 26, 2024

sdd commented Sep 26, 2024

sdd commented Sep 27, 2024

Changes to `TableScan`

`FileScanTask`

`TableScan::plan_files` and associated methods

xxhZs commented Sep 14, 2024 •

edited

Loading