-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need a mechanism to handle schema changes due to dictionary hydration in FlightSQL server implementations #6672
Comments
With flight_info requests the returned schema needs to exactly match the encoded flight data. The FlightDataEncoder will change the schema in order to hydrate dictionaries. This means the schema reported from flight_info requests did not match the encoded data. This change copies the logic from arrow-flight that can transform a schema for hydrated dictionaries. See apache/arrow-rs#6672 where the copied logic may be exposed directly.
With flight_info requests the returned schema needs to exactly match the encoded flight data. The FlightDataEncoder will change the schema in order to hydrate dictionaries. This means the schema reported from flight_info requests did not match the encoded data. This change copies the logic from arrow-flight that can transform a schema for hydrated dictionaries. See apache/arrow-rs#6672 where the copied logic may be exposed directly.
How about creating a FlightDataEncoder to encode an empty stream and then read the schema off the stream let empty_stream = FlightDataEncoderBuilder::new()
.with_schema(pre_encoded_schema)
.build(streams::iter(vec![]));
let schema = empty_stream.schema(); If that works, perhaps we can add an example to the documentation I would be hesitant to just make
FWIW the logical type idea will likely remain in DataFusion as there is no concept of LogicalType in the Arrow type system (for better / worse) |
@alamb Agreed, exposing the API is a fragile solution. I like your proposed approach however the FlightDataEncoder type does not expose a method to access the schema. However that would be a small addition to its API. Should we add a the function pub fn schema(&self) -> Option<SchemaRef> {
self.schema.clone()
} In cases where the schema is known upfront it will have been hydrated and in cases where its not known upfront a None is returned. Thoughts? Maybe we call the function |
Makes sense to me! |
With this change is now possible to inspect the encoded schema of the encoded Flight data, which may differ from the input schema based on dictionary hydration handling. Fixes apache#6672
|
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I am implementing a flight sql server using datafusion. See this logic that simply reports the flight_info schema as the result of the query schema.
The
FlightDataEncoder
has two modes for dictionary handling. In one mode it hydrates dictionaries thus changing the schema of the data during transport. The flight sql server needs to reflect the hydrated schema otherwise clients will be confused as the data received will not match the reported schema.Describe the solution you'd like
A simple solution would be to make this function public API so it can be reused.
Describe alternatives you've considered
I have seen but not followed closely work to create logical types separate from physical types. Possibly there is room for flight_info requests to report logical schemas and for the server to use any valid physical encoding of the data. This however requires much more coordination between clients and servers. Additionally its not clear that flight_info requests should actually deal in logical instead of physical schemas.
Additional context
My solution for now is to copy to the logic into the server implementation.
I'd be happy to submit a PR to make the function public if that is what we think is a good solution.
The text was updated successfully, but these errors were encountered: