-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core, Puffin: Add DV file writer #11476
Conversation
|
||
private final OutputFileFactory fileFactory; | ||
private final Function<CharSequence, PositionDeleteIndex> loadPreviousDeletes; | ||
private final CharSequenceMap<Deletes> deletesByPath = CharSequenceMap.create(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to discuss whether we should accept String
or CharSequence
in DV writers. We migrated to String
for our new DataFile
and DeleteFile
fields but it is a different use case. We don't have to necessarily serialize these values. If an engine supports custom CharSequence
objects, it may benefit from this API and skip the conversion.
It has no benefits for Spark, though. We have to convert to String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that String
makes more sense for now because:
- to my knowledge, most of query engines support
String
- if a query engine needs
CharSequence
, the conversion is not very painful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any engine that would take advantage of it here, but I'm not sure it really causes any significant complexity to maintain as CharSequence. I'm fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avro Utf8
is the only example I remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It's probably a good idea to use String. We can always move to CharSequence later with minimal changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on using String, which I've also mentioned in https://github.com/apache/iceberg/pull/11302/files#r1824137648
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let's use String
. It will have less overhead for engines like Spark that use String
.
@@ -21,8 +21,8 @@ | |||
import org.apache.iceberg.StructLike; | |||
|
|||
/** Copy the StructLike's values into a new one. It does not handle list or map values now. */ | |||
class StructCopy implements StructLike { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to move this into a utility. Just not sure which at this point. Ideas are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a "generic" IoUtil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this even belong in io
? This feels like it should be in types.TypeUtil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it belongs to StructLikeUtil
or something like this. Doesn't seem to belong to io
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for StructLikeUtil
or similar.
@jbonofre, the io
classes typically refer to FileIO
and buffer management utils in Iceberg. I'm not quite sure why this is currently in the io
package. I think it's probably better to not continue using io
so that it's easier to find later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I like the StructLikeUtil
name.
+1 for that. Thanks !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the logic to the util and deprecated this class not to break anyone.
} | ||
|
||
@TestTemplate | ||
public void testBasicDVs() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add more once we support DV reads.
return new Blob( | ||
StandardBlobTypes.DV_V1, | ||
ImmutableList.of(MetadataColumns.ROW_POSITION.fieldId()), | ||
-1 /* snapshot ID is inherited */, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per V1 Puffin spec.
* @param spec the data file partition spec | ||
* @param partition the data file partition | ||
*/ | ||
void write(CharSequence path, long pos, PartitionSpec spec, StructLike partition); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little bit of a nit here, but does write
make sense? We're actually marking for delete. FileAppender
and most "Writers" use add
, should this just be delete
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. What about DVFileWriter
as the name? That seems OK but if there are alternatives, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielcweeks from my standpoint, delete()
would be confusing. As the class is named DeleteVectorFileWriter
(yes, I would prefer a full name instead of DV
), I think it makes sense to use write()
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view, delete
makes sense now as previously we passed PositionDelete
object and used the common writer abstraction. That said, I am OK either way.
I do want to keep the shorter class name. I use DV
naming consistently throughout the code and it allows me to stay on one line most of the time. Our 100 character line limit is no joke!
52c0f37
to
e95cf2c
Compare
return StructCopy.copy(struct); | ||
} | ||
|
||
private static class StructCopy implements StructLike { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied the existing class but made private. The old one is deprecated and will be removed in 1.9.
this.tableDir = Files.createTempDirectory(temp, "junit").toFile(); | ||
assertThat(tableDir.delete()).isTrue(); // created during table creation | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.tableDir = Files.createTempDirectory(temp, "junit").toFile(); | |
assertThat(tableDir.delete()).isTrue(); // created during table creation |
FYI this typically isn't needed, because tableDir
is already initialized in the super class. See also #11460 where existing tests have been refactored to remove this unnecessary re-initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good call. Missed that change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also removed the metadata dir and empty line.
Thanks for reviewing, @nastra @danielcweeks @jbonofre @rdblue! |
This PR adds a base DV file writer with basic tests. More tests to come once we support DV reads.
This work is part of #11122.