Add `json_encode` to `pl.List` #14029

david-waterworth · 2024-01-26T23:26:38Z

Description

pl.Struct contains a json_encode / json_decode which amongst other things is useful if you want to dump a table to csv but it contains nested fields.

It would be useful if pl.List also contains at least json_encode (json_decode might be trickier as it would have to check that the json was a list).

At the moment my workaround is

df = pl.DataFrame(
    {
        "a": [1, 2, 3, 1],
        "b": [["a"], ["b"], ["c"], ["c"]],
    }
)

df.with_columns(pl.col("b").map_elements(lambda e: json.dumps(list(e))))

I considered using to_struct but that doesn't preserve the list structure.

Also the error from my original attempt

df.with_columns(pl.col("b").map_elements(lambda e: json.dumps(e)))

ComputeError: TypeError: Object of type Series is not JSON serializable

I guess internally a list is a series but that was surprising, Ideally the engine would internally cast to a python list before invoking the function being used by map_elements?

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-01-27T11:48:31Z

Yeah, I think I mentioned it at the time but may have gotten lost in translation.

Essentially the equivalent of:

df = pl.DataFrame({
    "a": [["1", "2"], ["3", "4"]],
    "b": [[dict(B=1)], [dict(B=2)]],
    "c": [dict(C=3), dict(C=4)],
    "d": [datetime.date.today(), None],
    "e": [5, 6]
})

# shape: (2, 5)
# ┌────────────┬─────────────────┬───────────┬────────────┬─────┐
# │ a          ┆ b               ┆ c         ┆ d          ┆ e   │
# │ ---        ┆ ---             ┆ ---       ┆ ---        ┆ --- │
# │ list[str]  ┆ list[struct[1]] ┆ struct[1] ┆ date       ┆ i64 │
# ╞════════════╪═════════════════╪═══════════╪════════════╪═════╡
# │ ["1", "2"] ┆ [{1}]           ┆ {3}       ┆ 2024-01-27 ┆ 5   │
# │ ["3", "4"] ┆ [{2}]           ┆ {4}       ┆ null       ┆ 6   │
# └────────────┴─────────────────┴───────────┴────────────┴─────┘

duckdb.sql("""
from df 
select a::json, b::json, c::json, d::json, e::json
""").pl()

# shape: (2, 5)
# ┌───────────────────┬───────────────────┬───────────────────┬───────────────────┬───────────────────┐
# │ CAST(a AS "json") ┆ CAST(b AS "json") ┆ CAST(c AS "json") ┆ CAST(d AS "json") ┆ CAST(e AS "json") │
# │ ---               ┆ ---               ┆ ---               ┆ ---               ┆ ---               │
# │ str               ┆ str               ┆ str               ┆ str               ┆ str               │
# ╞═══════════════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══════════════════╡
# │ ["1","2"]         ┆ [{"B":1}]         ┆ {"C":3}           ┆ "2024-01-27"      ┆ 5                 │
# │ ["3","4"]         ┆ [{"B":2}]         ┆ {"C":4}           ┆ null              ┆ 6                 │
# └───────────────────┴───────────────────┴───────────────────┴───────────────────┴───────────────────┘

(decoding is .str.json_decode() - it's not struct specific)

(I'm assuming .map_elements requires explicit conversion to list by the user for performance reasons.)

david-waterworth · 2024-01-29T00:48:28Z

Yeah after I posted the example I realised I also need to cast/encode a list[struct[*]] as well.

deanm0000 · 2024-01-29T17:07:03Z

map_elements doesn't work b/c to make that into json it needs the whole column at once and map_elements works on each item at a time.

Instead use map_batches like this:

df.select(pl.col('b').map_batches(lambda x: json.dumps(x.to_list())))

david-waterworth · 2024-01-29T21:40:40Z

@deanm0000 thanks for the suggestion but I actually want it to process one item at a time, I don't want a single valid json str for the entire column, I specifically want to convert each element to a json fragment. Both are valid use cases as I often work with jsonlines and in this case I just want to dump the entire frame to csv.

cmdlineluser · 2024-01-31T14:23:47Z

As far as I can tell, all that is needed is to add this to list.rs

polars/crates/polars-plan/src/dsl/function_expr/struct_.rs

Lines 123 to 134 in 2c5f4f3

    
           #[cfg(feature = "json")] 
        
           pub(super) fn to_json(s: &Series) -> PolarsResult<Series> { 
        
               let ca = s.struct_()?; 
        
               let dtype = ca.dtype().to_arrow(true); 
        
               let iter = ca.chunks().iter().map(|arr| { 
        
                   let arr = arrow::compute::cast::cast_unchecked(arr.as_ref(), &dtype).unwrap(); 
        
                   polars_json::json::write::serialize_to_utf8(arr.as_ref()) 
        
               }); 
        
               Ok(StringChunked::from_chunk_iter(ca.name(), iter).into_series()) 
        
           }

and change let ca = s.struct_()?; to let ca = s.list()?;

But perhaps someone can answer if this should be implemented as Expr.json_encode() instead?

And not limited to lists/structs?

DeflateAwning · 2024-05-15T19:19:58Z

Seems to be a duplicate of #8482.

Seems like a pretty easy solution. Would be awesome to be able to use pl.col('some_list_col').list.json_encode() (and the same for the arr namespace). Can this please be implemented?

david-waterworth added the enhancement New feature or an improvement of an existing feature label Jan 26, 2024

moritzwilksch linked a pull request Oct 23, 2024 that will close this issue

feat: Add .list.json_encode() #19407

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `json_encode` to `pl.List` #14029

Add `json_encode` to `pl.List` #14029

david-waterworth commented Jan 26, 2024

cmdlineluser commented Jan 27, 2024

david-waterworth commented Jan 29, 2024

deanm0000 commented Jan 29, 2024

david-waterworth commented Jan 29, 2024

cmdlineluser commented Jan 31, 2024

DeflateAwning commented May 15, 2024 •

edited

Loading

Add json_encode to pl.List #14029

Add json_encode to pl.List #14029

Comments

david-waterworth commented Jan 26, 2024

Description

cmdlineluser commented Jan 27, 2024

david-waterworth commented Jan 29, 2024

deanm0000 commented Jan 29, 2024

david-waterworth commented Jan 29, 2024

cmdlineluser commented Jan 31, 2024

DeflateAwning commented May 15, 2024 • edited Loading

Add `json_encode` to `pl.List` #14029

Add `json_encode` to `pl.List` #14029

DeflateAwning commented May 15, 2024 •

edited

Loading