Need a mechanism to handle schema changes due to dictionary hydration in FlightSQL server implementations #6672

nathanielc · 2024-11-01T19:34:25Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I am implementing a flight sql server using datafusion. See this logic that simply reports the flight_info schema as the result of the query schema.

The FlightDataEncoder has two modes for dictionary handling. In one mode it hydrates dictionaries thus changing the schema of the data during transport. The flight sql server needs to reflect the hydrated schema otherwise clients will be confused as the data received will not match the reported schema.

Describe the solution you'd like

A simple solution would be to make this function public API so it can be reused.
Describe alternatives you've considered

I have seen but not followed closely work to create logical types separate from physical types. Possibly there is room for flight_info requests to report logical schemas and for the server to use any valid physical encoding of the data. This however requires much more coordination between clients and servers. Additionally its not clear that flight_info requests should actually deal in logical instead of physical schemas.

Additional context

My solution for now is to copy to the logic into the server implementation.
I'd be happy to submit a PR to make the function public if that is what we think is a good solution.

The text was updated successfully, but these errors were encountered:

With flight_info requests the returned schema needs to exactly match the encoded flight data. The FlightDataEncoder will change the schema in order to hydrate dictionaries. This means the schema reported from flight_info requests did not match the encoded data. This change copies the logic from arrow-flight that can transform a schema for hydrated dictionaries. See apache/arrow-rs#6672 where the copied logic may be exposed directly.

alamb · 2024-11-04T20:09:56Z

How about creating a FlightDataEncoder to encode an empty stream and then read the schema off the stream

let empty_stream = FlightDataEncoderBuilder::new()
  .with_schema(pre_encoded_schema)
  .build(streams::iter(vec![]));
let schema = empty_stream.schema();

If that works, perhaps we can add an example to the documentation

I would be hesitant to just make prepare_schema_for_flight public as it seems somewhat brittle as the arguments need to remain in sync with however the FlightDataEncoder is constructed, but it uses different types

I have seen but not followed closely work to create logical types separate from physical types. Possibly there is room for flight_info requests to report logical schemas and for the server to use any valid physical encoding of the data. This however requires much more coordination between clients and servers. Additionally its not clear that flight_info requests should actually deal in logical instead of physical schemas.

FWIW the logical type idea will likely remain in DataFusion as there is no concept of LogicalType in the Arrow type system (for better / worse)

nathanielc · 2024-11-04T22:03:02Z

@alamb Agreed, exposing the API is a fragile solution.

I like your proposed approach however the FlightDataEncoder type does not expose a method to access the schema. However that would be a small addition to its API. Should we add a the function

pub fn schema(&self) -> Option<SchemaRef> {
    self.schema.clone()
}

In cases where the schema is known upfront it will have been hydrated and in cases where its not known upfront a None is returned. Thoughts? Maybe we call the function known_schema to make it clear its only available when the schema is known upfront?

alamb · 2024-11-05T12:41:32Z

Makes sense to me!

With this change is now possible to inspect the encoded schema of the encoded Flight data, which may differ from the input schema based on dictionary hydration handling. Fixes apache#6672

alamb · 2024-11-16T14:32:24Z

label_issue.py automatically added labels {'arrow'} from #6688

alamb · 2024-11-16T14:32:27Z

label_issue.py automatically added labels {'arrow-flight'} from #6688

nathanielc added the enhancement Any new improvement worthy of a entry in the changelog label Nov 1, 2024

nathanielc mentioned this issue Nov 1, 2024

fix: report hydrated schemas from flight_info requests datafusion-contrib/datafusion-federation#74

Merged

nathanielc mentioned this issue Nov 5, 2024

feat: expose known_schema from FlightDataEncoder #6688

Merged

alamb closed this as completed in #6688 Nov 8, 2024

alamb closed this as completed in 9471bfb Nov 8, 2024

alamb added the arrow Changes to the arrow crate label Nov 16, 2024

alamb added the arrow-flight Changes to the arrow-flight crate label Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a mechanism to handle schema changes due to dictionary hydration in FlightSQL server implementations #6672

Need a mechanism to handle schema changes due to dictionary hydration in FlightSQL server implementations #6672

nathanielc commented Nov 1, 2024

alamb commented Nov 4, 2024

nathanielc commented Nov 4, 2024

alamb commented Nov 5, 2024

alamb commented Nov 16, 2024

alamb commented Nov 16, 2024

Need a mechanism to handle schema changes due to dictionary hydration in FlightSQL server implementations #6672

Need a mechanism to handle schema changes due to dictionary hydration in FlightSQL server implementations #6672

Comments

nathanielc commented Nov 1, 2024

alamb commented Nov 4, 2024

nathanielc commented Nov 4, 2024

alamb commented Nov 5, 2024

alamb commented Nov 16, 2024

alamb commented Nov 16, 2024