Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update lance file format to support per-page encoding #1857

Closed
wants to merge 3 commits into from

Conversation

westonpace
Copy link
Contributor

This is a draft proposal for a format that supports:

  • Per-page encodings
  • More flexible encoding descriptions
  • Files without row groups
  • Wide schemas

There is no rush to merge this in. I am beginning to add create a file reader and writer that will support this format as part of adding support for nullability. Until the reader/writer are ready these format changes are useless on their own. However, I am creating this PR as a proposal for initial discussion.

Copy link

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Comment on lines 225 to 227
// This message is optional. If a file is not self-describing then this must
// point to the start of the metadata block (e.g. representing a manifest with
// size 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would we think of making the FileDescriptor not optional for Lance v2? IMO it would make the format easier to debug if individual files were readable independently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me. It shouldn't be too large and readers don't have to read it so I don't see any downside to making it mandatory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the language to reinforce that this is required in v2 files.

Comment on lines 183 to 186
// The file offset to the start of the page data
uint64 offset = 3;
// The size (in bytes) of the page data
uint64 size = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this redundant? Could we just say the offset is the first buffer_offsets? And the size is the sum of buffer_sizes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed these fields.

// | (optional padding) |
// | Page 1, Buffer 0 |
// | ... |
// | Page N2, Buffer N3 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The labels N2 and N3 are confusing. Perhaps they should be different labels, described in text above. For example:

M columns, N[m] pages for column m in 0..M.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think this over. I started with variables but since there were 5 of them I had N, M, X, Y, and Z and so I wasn't sure that was very readable either. I wanted to avoid implying that pages have the same # of buffers or that there are the same number of metadata pages as there are column pages, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a header explaining the lettering a bit and changed to something kind of like your suggestion. What do you think?

Comment on lines +171 to +185
// These offsets might point to the column's data section or they
// might point to the column's metadata section.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does in the data vs metadata section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data section is data that is unique to a page. Metadata section is data that is common across all pages in a column. E.g. if you wanted to have a single dictionary for all dictionary encoded pages in a column you could put it in the metadata section.

// arrays is equal the the number of pages for the column. The number of
// statistics pages may be smaller than the number of column pages (in most
// cases there will only be one page of statistics)
message ColumnStatistics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffers will be in the metadata section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized I forgot an actual reference to this message. I've added that in ColumnMetadata:

  repeated Page pages = 1;
  ColumnStatistics statistics = 2;

Comment on lines 42 to 46
// | Column 0 Metadata Position |
// | Column 0 Metadata Size |
// | ... |
// | Column N5 Metadata Position |
// | Column N5 Metadata Size |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "column metadata offsets array"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Comment on lines 25 to 27
// +------------------------------+
// | Page 0, Buffer 0 |
// | ... |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to label some sections:

(https://asciiflow.com/ is useful for this)

 // ┌───────────────────────────────┐
 // │Data Pages                     │
 // │  Page 0, Buffer 0             │
 // │  ...                          │
 // │  Page 0, Buffer N             │
 // │  (optional padding)           │
 // │  Page 1, Buffer 0             │
 // │  ...                          │
 // │  Page N2, Buffer N3           │
 // │  (optional padding)           │
 // ├───────────────────────────────┤
 // │Column Metadata                │
 // │  Column 0 Metadata            │
 // │  Column 0 Metadata Buffer 0   │
 // │  ...                          │
 // │  Column 0 Metadata Buffer N4  │
 // │  (optional padding)           │
 // │  Column 1 Metadata            │
 // │  ...                          │
 // │  Column N5 Metadata Buffer N6 │
 // │  (optional padding)           │
 // ├───────────────────────────────┤
 // │Column Metadata Offset Table   │
 // │  Column 0 Metadata Position   │
 // │  Column 0 Metadata Size       │
 // │  ...                          │
 // │  Column N5 Metadata Position  │
 // │  Column N5 Metadata Size      │
 // ├───────────────────────────────┤
 // │FileDescriptor                 │
 // ├───────────────────────────────┤
 // │Metadata                       │
 // ├───────────────────────────────┤
 // │Footer                         │
 // │  i64: Metadata position       │
 // │  u16: Major version           │
 // │  u16: Minor version           │
 // │  "LANC"                       │
 // └───────────────────────────────┘

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added labels

@westonpace westonpace changed the title feat: Lance file format v2 feat: update lance file format to support per-page encoding Jan 30, 2024
@westonpace westonpace marked this pull request as ready for review January 30, 2024 18:59
@westonpace westonpace requested a review from wjones127 January 30, 2024 18:59
Comment on lines +324 to +331
// The start of the column metadata section
//
// If column projection is not needed, or if the goal is to cache all column
// metadata, then this field can be used to quickly load the entire column metadata
// section in a single read without referring to the column metadata offsets array
//
// This field is ignored in Lance version 1 files
uint64 column_metadata_start = 7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives us the start, how do we know where the end is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the intention was to have it end at manifest_position - (16 * num_columns)

Comment on lines +304 to +323
// The number of columns in the file (including inner columns when there
// are nested fields)
//
// This can be used to access the column metadata offsets array which is
// stored immediately before manifest_position.
//
// Given N columns the column metadata positions and sizes are stored in a
// contiguous buffer of 2*N uint64 values immediately preceding the file
// descriptor (or the metadata if the file is not self describing).
//
// If we let `column_offsets_pos` be:
// manifest_position - (16 * num_columns)
//
// Then the metadata for column x starts at the uint64:
// file[column_offsets_pos + (16 * x)]
// The size of the metadata for column x is given by the uint64:
// file[column_offsets_pos + (16 * x) + 8]
//
// This field is ignored in Lance version 1 files (the page table is used instead)
uint32 num_columns = 6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this solves problem 1 described in #1809, but I'm not sure if it solves problem 2. Could you describe how x is derived? Do we skip fields with children that have no buffers of their own (struct, fsl in the future)?

Copy link
Contributor Author

@westonpace westonpace Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's two parts:

  • Gaps in field ids

I'm not sure yet why we needed to fill these gaps in the old approach. It's very possible I'm missing something. I was thinking, when we read the file, the column metadata at index X corresponds to the field at index X (in dfs-pre-order) of the schema in the file descriptor. Field IDs are a table concern and not something that the file format bothers with (other than storing and loading them faithfully)

  • Transient columns (structs / lists)

It's probably true that lists will never have a validity buffer (since we can always use negative values as sentinels) but we still need a flag in the metadata telling us if we need to post-process the sentinels or not. Also, (non-fixed-size) lists do have an offsets buffer, and there are many different ways that buffer could be encoded.

Struct might have a nullability bitmap or they might not. If the page doesn't have any nulls we can skip the buffer. If the page has nulls but doesn't have any all-null-structs then we can use a sentinel. If the page has nulls and all-null-structs then we need a bitmap.

We could also encode structs using a sparse bitmap like they do in procella. In this case there is a one validity bitmap buffer AND an offsets buffer per child.

The same goes with fixed-size-list. There may be a need for a validity buffer or we may choose to do some kind of sparse encoding of the FSL (basically turning it into a variable size list where zero-size lists represent null) if most of the data is null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure yet why we needed to fill these gaps in the old approach.

In the old way, the page table is represented in memory as a BTreeMap<FieldId, (Offset, Length)>. That BTreeMap is written to disk as an array, where the ith element has field id field_ids_offset + i. So as-is, the page table is coupled to the field ids.

I was thinking, when we read the file, the column metadata at index X corresponds to the field at index X (in dfs-pre-order) of the schema in the file descriptor.

This is fine, we just need to be really specific about what the dfs-pre-order is. Right now FSL doesn't have child field, and thus the current page table ignores them, even though it does have entries for structs. However, later we want to change schemas to include child fields of FSL, so they are aligned with Arrow schemas. We can either do that now, in our definition of the dfs-pre-order, or find some compatible way to make that change in the future.

I think my main concern at this point is the transition plan for FSL children. The status quo that we have column metadata for lists and structs makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I follow your concern better now.

The pragmatic answer to your question is that the only thing that matters for nesting is struct columns. FSL, regular lists, and maps do not have "depth" in the proposal (the protobuf encoding is recursive and does have depth but this is true even for primitive columns).

The outside-the-box new-age answer is that the file format does not have any depth at all. There is just a flat list of columns. "nesting" is an in-memory / arrow concept and we have to map that concept onto the file format in the reader /writer. For example, we could choose to map arrow struct arrays into a single "packed" column. We could choose to map arrow struct arrays into a single "shredded" column (the list of children, instead of being at the "column" level, would be instead nested into the encoding). Both of these approaches are probably bad ideas since they prevent any kind of nested projection pushdown (though the "packed struct" may still be useful in certain cases).

We could even nest other things. We could create a special kind of FSL, let's call it a tuple, where we encode each index into its own "lance column". This would allow us to do nested tuple projection. Although, again, being practical, it would be less work for everyone else (e.g. arrow, etc.) if we implemented such a thing as a struct.

Comment on lines +314 to +315
// If we let `column_offsets_pos` be:
// manifest_position - (16 * num_columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just have a direct offset here. That would make it possible to have another optional section later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand.

Copy link
Contributor

@wjones127 wjones127 Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in column_offsets_pos should be an actual field, and not just an implicit computed one. Then we could add a new section between the column metadata and the file metadata.

More generally, just feeling a bit squeamish about having to make these kind of computations to get the offset and size of a section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I get it now. I'll add that in.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see one example of a schema (with struct, list, fsl) and how that translates to what num_columns means and what each value 0..num_columns references.

// Null values are present but the bytes are garbage and should be
// ignored.
// * Validity - A boolean array representing whether each value is null or not
message MaskedEncoding {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW are you going to mention Sentinel encoding?

@westonpace
Copy link
Contributor Author

Closing in favor of #1965

@westonpace westonpace closed this Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants