-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update lance file format to support per-page encoding #1857
Conversation
ACTION NEEDED Lance follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
protos/file.proto
Outdated
// This message is optional. If a file is not self-describing then this must | ||
// point to the start of the metadata block (e.g. representing a manifest with | ||
// size 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we think of making the FileDescriptor
not optional for Lance v2? IMO it would make the format easier to debug if individual files were readable independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine with me. It shouldn't be too large and readers don't have to read it so I don't see any downside to making it mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the language to reinforce that this is required in v2 files.
protos/file.proto
Outdated
// The file offset to the start of the page data | ||
uint64 offset = 3; | ||
// The size (in bytes) of the page data | ||
uint64 size = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this redundant? Could we just say the offset
is the first buffer_offsets
? And the size
is the sum of buffer_sizes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed these fields.
protos/file.proto
Outdated
// | (optional padding) | | ||
// | Page 1, Buffer 0 | | ||
// | ... | | ||
// | Page N2, Buffer N3 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The labels N2
and N3
are confusing. Perhaps they should be different labels, described in text above. For example:
M
columns,N[m]
pages for columnm
in0..M
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think this over. I started with variables but since there were 5 of them I had N
, M
, X
, Y
, and Z
and so I wasn't sure that was very readable either. I wanted to avoid implying that pages have the same # of buffers or that there are the same number of metadata pages as there are column pages, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a header explaining the lettering a bit and changed to something kind of like your suggestion. What do you think?
// These offsets might point to the column's data section or they | ||
// might point to the column's metadata section. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does in the data vs metadata section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data section is data that is unique to a page. Metadata section is data that is common across all pages in a column. E.g. if you wanted to have a single dictionary for all dictionary encoded pages in a column you could put it in the metadata section.
// arrays is equal the the number of pages for the column. The number of | ||
// statistics pages may be smaller than the number of column pages (in most | ||
// cases there will only be one page of statistics) | ||
message ColumnStatistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffers will be in the metadata section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized I forgot an actual reference to this message. I've added that in ColumnMetadata
:
repeated Page pages = 1;
ColumnStatistics statistics = 2;
protos/file.proto
Outdated
// | Column 0 Metadata Position | | ||
// | Column 0 Metadata Size | | ||
// | ... | | ||
// | Column N5 Metadata Position | | ||
// | Column N5 Metadata Size | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "column metadata offsets array"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
protos/file.proto
Outdated
// +------------------------------+ | ||
// | Page 0, Buffer 0 | | ||
// | ... | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to label some sections:
(https://asciiflow.com/ is useful for this)
// ┌───────────────────────────────┐
// │Data Pages │
// │ Page 0, Buffer 0 │
// │ ... │
// │ Page 0, Buffer N │
// │ (optional padding) │
// │ Page 1, Buffer 0 │
// │ ... │
// │ Page N2, Buffer N3 │
// │ (optional padding) │
// ├───────────────────────────────┤
// │Column Metadata │
// │ Column 0 Metadata │
// │ Column 0 Metadata Buffer 0 │
// │ ... │
// │ Column 0 Metadata Buffer N4 │
// │ (optional padding) │
// │ Column 1 Metadata │
// │ ... │
// │ Column N5 Metadata Buffer N6 │
// │ (optional padding) │
// ├───────────────────────────────┤
// │Column Metadata Offset Table │
// │ Column 0 Metadata Position │
// │ Column 0 Metadata Size │
// │ ... │
// │ Column N5 Metadata Position │
// │ Column N5 Metadata Size │
// ├───────────────────────────────┤
// │FileDescriptor │
// ├───────────────────────────────┤
// │Metadata │
// ├───────────────────────────────┤
// │Footer │
// │ i64: Metadata position │
// │ u16: Major version │
// │ u16: Minor version │
// │ "LANC" │
// └───────────────────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added labels
bd80190
to
0397756
Compare
// The start of the column metadata section | ||
// | ||
// If column projection is not needed, or if the goal is to cache all column | ||
// metadata, then this field can be used to quickly load the entire column metadata | ||
// section in a single read without referring to the column metadata offsets array | ||
// | ||
// This field is ignored in Lance version 1 files | ||
uint64 column_metadata_start = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives us the start, how do we know where the end is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the intention was to have it end at manifest_position - (16 * num_columns)
// The number of columns in the file (including inner columns when there | ||
// are nested fields) | ||
// | ||
// This can be used to access the column metadata offsets array which is | ||
// stored immediately before manifest_position. | ||
// | ||
// Given N columns the column metadata positions and sizes are stored in a | ||
// contiguous buffer of 2*N uint64 values immediately preceding the file | ||
// descriptor (or the metadata if the file is not self describing). | ||
// | ||
// If we let `column_offsets_pos` be: | ||
// manifest_position - (16 * num_columns) | ||
// | ||
// Then the metadata for column x starts at the uint64: | ||
// file[column_offsets_pos + (16 * x)] | ||
// The size of the metadata for column x is given by the uint64: | ||
// file[column_offsets_pos + (16 * x) + 8] | ||
// | ||
// This field is ignored in Lance version 1 files (the page table is used instead) | ||
uint32 num_columns = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this solves problem 1 described in #1809, but I'm not sure if it solves problem 2. Could you describe how x
is derived? Do we skip fields with children that have no buffers of their own (struct, fsl in the future)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's two parts:
- Gaps in field ids
I'm not sure yet why we needed to fill these gaps in the old approach. It's very possible I'm missing something. I was thinking, when we read the file, the column metadata at index X corresponds to the field at index X (in dfs-pre-order) of the schema in the file descriptor. Field IDs are a table concern and not something that the file format bothers with (other than storing and loading them faithfully)
- Transient columns (structs / lists)
It's probably true that lists will never have a validity buffer (since we can always use negative values as sentinels) but we still need a flag in the metadata telling us if we need to post-process the sentinels or not. Also, (non-fixed-size) lists do have an offsets buffer, and there are many different ways that buffer could be encoded.
Struct might have a nullability bitmap or they might not. If the page doesn't have any nulls we can skip the buffer. If the page has nulls but doesn't have any all-null-structs then we can use a sentinel. If the page has nulls and all-null-structs then we need a bitmap.
We could also encode structs using a sparse bitmap like they do in procella. In this case there is a one validity bitmap buffer AND an offsets buffer per child.
The same goes with fixed-size-list. There may be a need for a validity buffer or we may choose to do some kind of sparse encoding of the FSL (basically turning it into a variable size list where zero-size lists represent null) if most of the data is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure yet why we needed to fill these gaps in the old approach.
In the old way, the page table is represented in memory as a BTreeMap<FieldId, (Offset, Length)>
. That BTreeMap
is written to disk as an array, where the i
th element has field id field_ids_offset + i
. So as-is, the page table is coupled to the field ids.
I was thinking, when we read the file, the column metadata at index X corresponds to the field at index X (in dfs-pre-order) of the schema in the file descriptor.
This is fine, we just need to be really specific about what the dfs-pre-order is. Right now FSL doesn't have child field, and thus the current page table ignores them, even though it does have entries for structs. However, later we want to change schemas to include child fields of FSL, so they are aligned with Arrow schemas. We can either do that now, in our definition of the dfs-pre-order, or find some compatible way to make that change in the future.
I think my main concern at this point is the transition plan for FSL children. The status quo that we have column metadata for lists and structs makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I follow your concern better now.
The pragmatic answer to your question is that the only thing that matters for nesting is struct columns. FSL, regular lists, and maps do not have "depth" in the proposal (the protobuf encoding is recursive and does have depth but this is true even for primitive columns).
The outside-the-box new-age answer is that the file format does not have any depth at all. There is just a flat list of columns. "nesting" is an in-memory / arrow concept and we have to map that concept onto the file format in the reader /writer. For example, we could choose to map arrow struct arrays into a single "packed" column. We could choose to map arrow struct arrays into a single "shredded" column (the list of children, instead of being at the "column" level, would be instead nested into the encoding). Both of these approaches are probably bad ideas since they prevent any kind of nested projection pushdown (though the "packed struct" may still be useful in certain cases).
We could even nest other things. We could create a special kind of FSL, let's call it a tuple, where we encode each index into its own "lance column". This would allow us to do nested tuple projection. Although, again, being practical, it would be less work for everyone else (e.g. arrow, etc.) if we implemented such a thing as a struct.
// If we let `column_offsets_pos` be: | ||
// manifest_position - (16 * num_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just have a direct offset here. That would make it possible to have another optional section later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in column_offsets_pos
should be an actual field, and not just an implicit computed one. Then we could add a new section between the column metadata and the file metadata.
More generally, just feeling a bit squeamish about having to make these kind of computations to get the offset and size of a section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I get it now. I'll add that in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see one example of a schema (with struct, list, fsl) and how that translates to what num_columns
means and what each value 0..num_columns
references.
// Null values are present but the bytes are garbage and should be | ||
// ignored. | ||
// * Validity - A boolean array representing whether each value is null or not | ||
message MaskedEncoding { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW are you going to mention Sentinel encoding?
Closing in favor of #1965 |
This is a draft proposal for a format that supports:
There is no rush to merge this in. I am beginning to add create a file reader and writer that will support this format as part of adding support for nullability. Until the reader/writer are ready these format changes are useless on their own. However, I am creating this PR as a proposal for initial discussion.