Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Update Arrow layers to support both RecordBatch and Table input #145

Open
kylebarron opened this issue Sep 25, 2024 · 7 comments
Open

Comments

@kylebarron
Copy link
Collaborator

Target Use Case

Simplify the implementation of Arrow layers by requiring RecordBatch input, not Table input.

Why:

  • This maps to the existing data structures supported by the deck.gl binary attributes API.

  • It's more reliable for the end user, as they know that a single arrow layer will always create one underlying deck.gl layer.

  • Remove need for internal rechunking code.

    Multiple arrow Vectors that have the same overall length can have different chunking structures. E.g. despite column A and column B both having length 30, column A could have two chunks (Data in Arrow JS terminology) of 15 rows each, and column B could have three chunks of 10 rows each. If deck.gl's Arrow support allowed Vector input, then deck.gl would have to manage rechunking the data across input objects.

    In Lonboard, I don't currently hit this issue because I pre-process the data in Python, but for JS-facing APIs, I think it would significantly simplify the deck.gl implementation to accept only RecordBatch input, which enforces contiguous buffers. This pushes the responsibility of rechunking onto the user, if necessary. There can be multiple options for rechunking Arrow data, including pure-JS and Wasm compiled options, and the end user can choose the best option for their use case.

Proposal

Right now the arrow layers accept a Table for the main data prop and arrow Vector objects for any accessors. This would change these layers to accept a RecordBatch for the main data prop and arrow contiguous arrays (called Data in the Arrow JS implementation) for any accessors.

Details

No response

@kylebarron
Copy link
Collaborator Author

cc @felixpalmer as well

@ibgreen
Copy link
Contributor

ibgreen commented Sep 25, 2024

No objections, just engaging in the conversation:

  • It seems to me that it would be nice to also accept a Table in addition to RecordBatch, and in this case just extract the entire data from the table (i.e. from all batches) into the required attributes.

  • That would give the application a choice of using one layer per batch or one layer per table. I think most initial use cases I would just want to load a table and pass it to a layer, and not worry about the batch structure.

  • I must admit that I am not quite why the accessors should accept Data (I seem to recall that Data contains the binary columns and is not just a descriptor, and if so such objects could contain columns that need not even be part of the same table?)

  • I would have expected that the job of the accessors would simply be to specify which columns in a table or batch to map to the various deck.gl attributes?

  • Alternatively the basic layer could just accept "a random assortment of Data objects", and the table/record batch input is a higher level abstraction that feeds Data from a table or batch to the more primitive construct?

@kylebarron
Copy link
Collaborator Author

kylebarron commented Sep 25, 2024

For anyone else looking at this issue, it's probably good to define some terminology:

  • Data a collection of rows in contiguous Arrow memory. This is called "Array" in most arrow implementations but is called Data in Arrow JS to avoid shadowing the JS Array type. Data can have one or more underlying buffers but those buffers all represent the same data. E.g. integer storage like a Data of type Uint8 has two buffers: one for the raw data (directly viewable by a Uint8Array) and another for the nullability bitmask: one bit for each row to confer whether the row is null or not. Nested types can have more buffers. E.g. points can be represented as a Data of struct type, where there's a buffer for the x coordinates and another buffer for the y coordinates.
  • Vector a collection of rows in batches. This is essentially a list of Data.
  • Field: metadata that describes an individual Data or Vector. This contains name: string, data type, nullable: bool, and metadata: Map<string, string>.
  • Schema: metadata that describes a named collection of Data or Vector. This is essentially List<Field>, but it can also store optional associated metadata: Map<string, string>.
  • RecordBatch an ordered and named collection of Data instances. This is essentially a List<Data> plus a Schema.
  • Table: an ordered and named collection of Vector instances. This is essentially a List<Vector> plus a Schema.
  • It seems to me that it would be nice to also accept a Table in addition to RecordBatch, and in this case just extract the entire data from the table (i.e. from all batches) into the required attributes.

Extract how, and would this involve CPU copies? If we have a Vector representing point data, then would this extract allocate a new buffer large enough to store the entire Vector in contiguous memory and copy each Data into that vector? Essentially a concat operation?

Only allowing RecordBatch input would be simpler for deck.gl layers to implement, as they would never need to handle this concat operation across multiple Data instances. (If the user wants to concat, they can do that in external code)

Only allowing RecordBatch input also means that deck.gl would virtually never need to allocate new data on the CPU (except for some cases like polygon triangulation).

(It may be that I'm constrained a bit in my thinking because I'm used to the existing binary attributes API. Perhaps there would be a way to accept Vector input without doing a concat on CPU?)

  • That would give the application a choice of using one layer per batch or one layer per table. I think most initial use cases I would just want to load a table and pass it to a layer, and not worry about the batch structure.

That's fair, and I originally agreed. But it's very easy to do a loop over the batches of a table and create a deck.gl layer for each one. That also shows the user that it's rendering one deck.gl layer per batch.

  • I must admit that I am not quite why the accessors should accept Data (I seem to recall that Data contains the binary columns and is not just a descriptor, and if so such objects could contain columns that need not even be part of the same table?)

Correct, Data contains the actual contiguous binary data, and is not "just" a reference to a column in the table. And correct, such objects do not need to be part of the same table. The only restriction is that they have the same number of rows as the primary data argument.

The existing implementation already allows this, just expects Vector input instead of Data input.

Lonboard uses this feature a lot because it allows Lonboard to move accessors to JS separately from the main data table. So when an attribute like get_fill_color is regenerated on the Python side, the main table doesn't change at all.

The complexity here is that there's no guaranteed that the Vector passed in to an accessor has the same internal chunking as the main table. Until recently (developmentseed/lonboard#644) Lonboard silently failed on this, because looping ("zipping") the main table's chunks and the separate Vector's chunks would provide objects with different numbers of rows.

It's easier for the JS implementation to accept only RecordBatch and Data objects, because all it has to do is assert that they have the same number of rows.

  • I would have expected that the job of the accessors would simply be to specify which columns in a table or batch to map to the various deck.gl attributes?

It's not usually enough (for JS native applications) to specify an existing column within a table because there might be some operation required to derive the accessor data from the original data in the table.

In the case that an existing column should be passed directly as an accessor, it's also very easy to do this in the existing API:

ScatterplotLayer({
  data: recordBatch,
  getFillColor: recordBatch.getChild("fillColor"), // accesses the relevant `Data` by name
})
  • Alternatively the basic layer could just accept "a random assortment of Data objects", and the table/record batch input is a higher level abstraction that feeds Data from a table or batch to the more primitive construct?

Essentially yes. There's no theoretical requirement here for a primary data object. You could pass in all the accessors directly via Data objects and never need to pass something into the layer's data parameter. What is useful about the data object is knowing relevant information for the tooltip. You might pass a table/record batch with several columns into the data prop, then when picking, it accesses a row index, and that row index is looked up in the batch to access the attributes to show in the tooltip.

@ibgreen
Copy link
Contributor

ibgreen commented Sep 26, 2024

For anyone else looking at this issue, it's probably good to define some terminology:

Nice, I am copying this to the loaders.gl ArrowJS docs which we can link to.

Extract how, and would this involve CPU copies? If we have a Vector representing point data, then would this extract allocate a new buffer large enough to store the entire Vector in contiguous memory and copy each Data into that vector? Essentially a concat operation?

Yes the trivial (initial) implementation would be to "concat" all the data into a CPU array and then pass that to deck.gl. However, making deck.gl accept an array of arrays for each attribute would not be hard, and it could then allocate a GPU buffer of the required size and just do a bunch of async GPU buffer writes. This would be part of the "deck.gl v10" overhaul (but could happen sooner of course if the direction is set).

I understand that you are keen to stay true to the spirit of arrow and avoid any CPU copies, however from my point of view it would be fine to offer Table support now and just document that this currently involves CPU memory copies. People with zero-copy priorities would be able to supply RecordBatches.

But it's very easy to do a loop over the batches of a table and create a deck.gl layer for each one. That also shows the user that it's rendering one deck.gl layer per batch.

Yes, this seems to be our biggest philosophical difference that we keep coming back to. To avoid a never-ending thread, perhaps best to discuss in person.

  • The one layer per batch simply doesn't make sense to me as a general solution. When test streaming multi-gigabyte ArrowJS files to the browser I have been using hundreds or even a thousand batches to make sure data updates frequently and having one layer per batch is not appealing.
  • Also I'd like deck.gl layers to work on tables, not pieces of tables. That is not how I would like to present our API.
  • To me supporting both Tables and RecordBatches seems like a perfect compromise. We both get what we want.

Correct, Data contains the actual contiguous binary data, and is not "just" a reference to a column in the table. And correct, such objects do not need to be part of the same table. The only restriction is that they have the same number of rows as the primary data argument.

My concern is mainly that I don't like mixing abstractions. Either a layer takes a map of Data, or it takes a Table.

  • Also isn't a Table is just a cheap javascript structure that organizes a list of columns that each have a Data object
  • And adding or replacing a column to a table is a cheap O(1) operation.
  • Couldn't the user just use the Arrow API to create a new Table replacing the "acccessor" columns with the new Data

@kylebarron
Copy link
Collaborator Author

Yes the trivial (initial) implementation would be to "concat" all the data into a CPU array and then pass that to deck.gl. However, making deck.gl accept an array of arrays for each attribute would not be hard, and it could then allocate a GPU buffer of the required size and just do a bunch of async GPU buffer writes.

Note that implementing concat for arbitrary input is not entirely trivial, as for some data types (especially bitmasks) you can't just concatenate the underlying buffers.

It is very cool that you could allocate one GPU buffer and copy multiple regions of source data into that one target buffer. That does make Table input more attractive in the long run.

I understand that you are keen to stay true to the spirit of arrow and avoid any CPU copies, however from my point of view it would be fine to offer Table support now and just document that this currently involves CPU memory copies. People with zero-copy priorities would be able to supply RecordBatches.

That's fair.

Yes, this seems to be our biggest philosophical difference that we keep coming back to. To avoid a never-ending thread, perhaps best to discuss in person.

I think at some level I don't know the pros/cons of having many layers in deck.gl. It seems the primary drawback is in picking that you'll overflow the (current) max of 255 layers?

  • Also I'd like deck.gl layers to work on tables, not pieces of tables. That is not how I would like to present our API.

FWIW I do agree that presenting a Table is a nicer API. The existing code expects Table input, and it's only recently that I've been struggling more with rechunking (since I don't want to always concat) that having RecordBatch input has seemed more attractive.

I suppose it's not too hard to support both, especially if the table input is just concatted into a single record batch.

if (input instanceof arrow.Table) {
  batch = concatTable(input);
} else if (input instanceof arrow.RecordBatch) {
  batch = input;
} else {
  throw new Error("unknown input")
}

My concern is mainly that I don't like mixing abstractions. Either a layer takes a map of Data, or it takes a Table.

On the contrary I think using Data/`Vector objects is a clean mapping to the existing object-based deck API.

Just as the existing deck API accepts an array of JS objects into the data prop, I think the data prop should accept a Table: essentially those same JS objects but in columnar table form.

The existing deck API doesn't require that those existing JS objects already contain the accessor information. Those accessors are defined either as function callbacks or as new buffers passed directly as attribute. This directly maps to the case of passing Data as an accessor prop.

I forget this because I never use it myself, but the existing Arrow implementation actually allows a function callback here as well, in which case the function callback receives an arrow Row object.

props[propName] = <In>(object: any, objectInfo: AccessorContext<In>) => {
// Special case that doesn't have the same parameters
if (propName === 'getPolygonOffset') {
return propInput(object, objectInfo);
}
return wrapAccessorFunction(objectInfo, propInput);
};

  • And adding or replacing a column to a table is a cheap O(1) operation.

This is true, but separating the accessors outside of the table should make it easier for deck to see when the geometry/main table has updated vs only a single accessor. So in Lonboard I never update the main table after initial render of each layer. This means that the geometries never need to be rendered again from scratch. Passing the buffers in as separate objects shows deck that only those accessors need to be recomputed.

@kylebarron kylebarron changed the title [Feat] Update Arrow layers to accept RecordBatch, not Table [Feat] Update Arrow layers to support both RecordBatch and Table input Sep 26, 2024
@ibgreen
Copy link
Contributor

ibgreen commented Sep 27, 2024

Good discussion. I think we are aligned enough to proceed. Just continuing the conversation on some of the interesting topics

The existing deck API doesn't require that those existing JS objects already contain the accessor information. Those accessors are defined either as function callbacks or as new buffers passed directly as attribute. This directly maps to the case of passing Data as an accessor prop.

True. The intention of the JS accessor functions wasn't really to support using columns from different tables, but I suppose they can be used that way. The binary accessor API certainly wasn't designed or audited in a thoughtful way, we just tried to quickly expose a way to allow binary data to be passed in.

The nascent mental model in my mind is that a deck.gl layer would accept a GPUTable type object (an "Arrow table style" class where columns are GPU Buffers.

We'd build a layer independent system that maps Arrow Tables and Arrow RecordBatches and Data objects into GPUTables and perhaps GPUColumns, etc.

Then the layer can accept either a GPUTable or an ArrowTable in which case it will convert to a GPU Table under the hood.

So in Lonboard I never update the main table after initial render of each layer. This means that the geometries never need to be rendered again from scratch. Passing the buffers in as separate objects shows deck that only those accessors need to be recomputed.

My position is that as a functional programming API, "the business of deck.gl is diffing", so if we treat Arrow Tables as first class citizens we should implement diffing that understand the internal structure of Arrow Tables and ignore any columns who have Data objects we have already uploaded to the GPU, so that the user doesn't have to manage such separate Data objects (unless he or she wants to, i.e. I don't want to disallow it but would love to provide a "simpler path" as well).

As a side note, it would also be neat if there was a "declarative" way to specify simple accessors into an arrow table. Maybe strings with column names, that doesn't require JS code. Then we could support arrow layers (for simple Arrow tables) in deck.gl/json, deck.gl playground, traditional pydeck etc.

I think at some level I don't know the pros/cons of having many layers in deck.gl. It seems the primary drawback is in picking that you'll overflow the (current) max of 255 layers?

I happen to have a new PoC picking manager in luma.gl that uses WebGPU / WebGL2 techniques to remove the picking limit.

However there is still a performance "limit" to how many layers deck.gl can handle. Hundreds of layers performs well, but thousands will start to tax the diffing engine and generate a lot of draw calls etc. The decline will be gradual but layers aren't completely "free".

One use case I didn't think of mentioning is streaming loads via loaders.gl from non-Arrow formats into Arrow. There I want to emit RecordBatches as data comes in and since the data size is often not known a priori, it is impossible to limit the number of batches being generated (other than stop emitting batches after some limit and finish of with a "monster" batch, or perhaps do exponentially larger batches towards the end...

@kylebarron
Copy link
Collaborator Author

The intention of the JS accessor functions wasn't really to support using columns from different tables, but I suppose they can be used that way.

From my point of view, those accessor functions are creating new data. The new data is usually derived from the primary table (data prop) but there's no actual restriction here. The only restriction is that the output buffers are compatible; essentially that they have the same length.

So likewise with the Arrow API it's possible to enforce that the attributes are part of the original table, but that's unnecessary rigidity.

By accepting a Vector/Data object we get a much more flexible API.

If the attribute data is already in the table:

ScatterplotLayer({
  data: table,
  getColor: table.getChild("colors"),
})

If the attribute data is generated separately:

const colorVector = new Arrow.Vector(...);
// This assertion would be moved internal somewhere
assert(table.length() == colorVector.length());
ScatterplotLayer({
  data: table,
  getColor: colorVector,
})

The GPU concepts should align; a GPUTable could be constructed as a collection of GPUVector, and then passing Vector would copy it to a GPUVector. And GPUTable would just need to have a getChild method to access a GPUVector, for symmetry with Arrow JS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants