-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Arrow/Parquet to 51.0.0
, tonic to 0.11
#9613
Changes from 6 commits
6ee1fe0
26c4df3
6d506b5
3267465
e2d39a6
6dd8b0e
0c2c918
59332ee
a8bc49b
456e2fe
f550b64
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,61 +15,61 @@ | |
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
use arrow::array::AsArray; | ||
use arrow::datatypes::{Float64Type, Int32Type}; | ||
use datafusion::error::Result; | ||
use datafusion::prelude::*; | ||
use serde::Deserialize; | ||
use futures::StreamExt; | ||
|
||
/// This example shows that it is possible to convert query results into Rust structs . | ||
/// It will collect the query results into RecordBatch, then convert it to serde_json::Value. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. apache/arrow-rs#5318 deprecated the serde_json based APIs |
||
/// Then, serde_json::Value is turned into Rust's struct. | ||
/// Any datatype with `Deserialize` implemeneted works. | ||
#[tokio::main] | ||
async fn main() -> Result<()> { | ||
let data_list = Data::new().await?; | ||
println!("{data_list:#?}"); | ||
Ok(()) | ||
} | ||
|
||
#[derive(Deserialize, Debug)] | ||
#[derive(Debug)] | ||
struct Data { | ||
#[allow(dead_code)] | ||
int_col: i64, | ||
int_col: i32, | ||
#[allow(dead_code)] | ||
double_col: f64, | ||
} | ||
|
||
impl Data { | ||
pub async fn new() -> Result<Vec<Self>> { | ||
// this group is almost the same as the one you find it in parquet_sql.rs | ||
let batches = { | ||
let ctx = SessionContext::new(); | ||
let ctx = SessionContext::new(); | ||
|
||
let testdata = datafusion::test_util::parquet_test_data(); | ||
let testdata = datafusion::test_util::parquet_test_data(); | ||
|
||
ctx.register_parquet( | ||
"alltypes_plain", | ||
&format!("{testdata}/alltypes_plain.parquet"), | ||
ParquetReadOptions::default(), | ||
) | ||
.await?; | ||
ctx.register_parquet( | ||
"alltypes_plain", | ||
&format!("{testdata}/alltypes_plain.parquet"), | ||
ParquetReadOptions::default(), | ||
) | ||
.await?; | ||
|
||
let df = ctx | ||
.sql("SELECT int_col, double_col FROM alltypes_plain") | ||
.await?; | ||
let df = ctx | ||
.sql("SELECT int_col, double_col FROM alltypes_plain") | ||
.await?; | ||
|
||
df.clone().show().await?; | ||
df.clone().show().await?; | ||
|
||
df.collect().await? | ||
}; | ||
let batches: Vec<_> = batches.iter().collect(); | ||
let mut stream = df.execute_stream().await?; | ||
let mut list = vec![]; | ||
while let Some(b) = stream.next().await.transpose()? { | ||
let int_col = b.column(0).as_primitive::<Int32Type>(); | ||
let float_col = b.column(1).as_primitive::<Float64Type>(); | ||
|
||
// converts it to serde_json type and then convert that into Rust type | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do think showing how to use serde to convert arrow --> rust structs is important. While I am well aware its performance is not good, the serde concept is widely understood and supported in the rust Ecosystem. Is there any API that can do serde into Rust structs in the core arrow crates anymore? If not, perhaps we can point in comments at a crate like https://github.com/chmp/serde_arrow (or bring an example that parses the JSON back to We/I can do this as a follow on PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can serialize to JSON and parse it, but I would rather encourage people towards the performant way of doing things
I'd dispute that we ever really had a way to do this, going via serde_json::Value is more of a hack than anything else. Serializing to a JSON string and back will likely be faster There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The key thing in my mind is to make it easy / quick for new users to get something working quickly. I am well aware that custom array -> struct will be the fastest performance, but I think it takes non trivial expertise in manipulating the arrow-rs API (especially when it comes to StructArray and ListArray) -- so offering them a fast way to get started with a slower API is important I think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think since this is an example, we can always update / improve it as a follow on PR |
||
let list = arrow::json::writer::record_batches_to_json_rows(&batches[..])? | ||
.into_iter() | ||
.map(|val| serde_json::from_value(serde_json::Value::Object(val))) | ||
.take_while(|val| val.is_ok()) | ||
.map(|val| val.unwrap()) | ||
.collect(); | ||
for (i, f) in int_col.values().iter().zip(float_col.values()) { | ||
list.push(Data { | ||
int_col: *i, | ||
double_col: *f, | ||
}) | ||
} | ||
} | ||
|
||
Ok(list) | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -156,6 +156,7 @@ pub(crate) fn parse_encoding_string( | |
"plain" => Ok(parquet::basic::Encoding::PLAIN), | ||
"plain_dictionary" => Ok(parquet::basic::Encoding::PLAIN_DICTIONARY), | ||
"rle" => Ok(parquet::basic::Encoding::RLE), | ||
#[allow(deprecated)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand the reference (to the JSON writer) when this is for parquet encoding. Is there some other encoding/compression scheme that was deprecated too? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a copypasta meant to link to apache/arrow-rs#5348 |
||
"bit_packed" => Ok(parquet::basic::Encoding::BIT_PACKED), | ||
"delta_binary_packed" => Ok(parquet::basic::Encoding::DELTA_BINARY_PACKED), | ||
"delta_length_byte_array" => { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer need this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 456e2fe