Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need a mechanism to handle schema changes due to dictionary hydration in FlightSQL server implementations #6672

Closed
nathanielc opened this issue Nov 1, 2024 · 5 comments · Fixed by #6688
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@nathanielc
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I am implementing a flight sql server using datafusion. See this logic that simply reports the flight_info schema as the result of the query schema.

The FlightDataEncoder has two modes for dictionary handling. In one mode it hydrates dictionaries thus changing the schema of the data during transport. The flight sql server needs to reflect the hydrated schema otherwise clients will be confused as the data received will not match the reported schema.

Describe the solution you'd like

A simple solution would be to make this function public API so it can be reused.
Describe alternatives you've considered

I have seen but not followed closely work to create logical types separate from physical types. Possibly there is room for flight_info requests to report logical schemas and for the server to use any valid physical encoding of the data. This however requires much more coordination between clients and servers. Additionally its not clear that flight_info requests should actually deal in logical instead of physical schemas.

Additional context

My solution for now is to copy to the logic into the server implementation.
I'd be happy to submit a PR to make the function public if that is what we think is a good solution.

@nathanielc nathanielc added the enhancement Any new improvement worthy of a entry in the changelog label Nov 1, 2024
nathanielc added a commit to nathanielc/datafusion-federation that referenced this issue Nov 1, 2024
With flight_info requests the returned schema needs to exactly match the
encoded flight data. The FlightDataEncoder will change the schema in
order to hydrate dictionaries. This means the schema reported from
flight_info requests did not match the encoded data.

This change copies the logic from arrow-flight that can transform a schema
for hydrated dictionaries. See
apache/arrow-rs#6672 where the copied logic
may be exposed directly.
backkem pushed a commit to datafusion-contrib/datafusion-federation that referenced this issue Nov 4, 2024
With flight_info requests the returned schema needs to exactly match the
encoded flight data. The FlightDataEncoder will change the schema in
order to hydrate dictionaries. This means the schema reported from
flight_info requests did not match the encoded data.

This change copies the logic from arrow-flight that can transform a schema
for hydrated dictionaries. See
apache/arrow-rs#6672 where the copied logic
may be exposed directly.
@alamb
Copy link
Contributor

alamb commented Nov 4, 2024

How about creating a FlightDataEncoder to encode an empty stream and then read the schema off the stream

let empty_stream = FlightDataEncoderBuilder::new()
  .with_schema(pre_encoded_schema)
  .build(streams::iter(vec![]));
let schema = empty_stream.schema();

If that works, perhaps we can add an example to the documentation

I would be hesitant to just make prepare_schema_for_flight public as it seems somewhat brittle as the arguments need to remain in sync with however the FlightDataEncoder is constructed, but it uses different types

I have seen but not followed closely work to create logical types separate from physical types. Possibly there is room for flight_info requests to report logical schemas and for the server to use any valid physical encoding of the data. This however requires much more coordination between clients and servers. Additionally its not clear that flight_info requests should actually deal in logical instead of physical schemas.

FWIW the logical type idea will likely remain in DataFusion as there is no concept of LogicalType in the Arrow type system (for better / worse)

@nathanielc
Copy link
Contributor Author

@alamb Agreed, exposing the API is a fragile solution.

I like your proposed approach however the FlightDataEncoder type does not expose a method to access the schema. However that would be a small addition to its API. Should we add a the function

pub fn schema(&self) -> Option<SchemaRef> {
    self.schema.clone()
}

In cases where the schema is known upfront it will have been hydrated and in cases where its not known upfront a None is returned. Thoughts? Maybe we call the function known_schema to make it clear its only available when the schema is known upfront?

@alamb
Copy link
Contributor

alamb commented Nov 5, 2024

Makes sense to me!

nathanielc added a commit to nathanielc/arrow-rs that referenced this issue Nov 5, 2024
With this change is now possible to inspect the encoded schema of the
encoded Flight data, which may differ from the input schema based on
dictionary hydration handling.

Fixes apache#6672
@alamb alamb closed this as completed in 9471bfb Nov 8, 2024
@alamb alamb added the arrow Changes to the arrow crate label Nov 16, 2024
@alamb
Copy link
Contributor

alamb commented Nov 16, 2024

label_issue.py automatically added labels {'arrow'} from #6688

@alamb alamb added the arrow-flight Changes to the arrow-flight crate label Nov 16, 2024
@alamb
Copy link
Contributor

alamb commented Nov 16, 2024

label_issue.py automatically added labels {'arrow-flight'} from #6688

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants