Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add json_encode to pl.List #14029

Open
david-waterworth opened this issue Jan 26, 2024 · 6 comments · May be fixed by #19407
Open

Add json_encode to pl.List #14029

david-waterworth opened this issue Jan 26, 2024 · 6 comments · May be fixed by #19407
Labels
enhancement New feature or an improvement of an existing feature

Comments

@david-waterworth
Copy link

Description

pl.Struct contains a json_encode / json_decode which amongst other things is useful if you want to dump a table to csv but it contains nested fields.

It would be useful if pl.List also contains at least json_encode (json_decode might be trickier as it would have to check that the json was a list).

At the moment my workaround is

df = pl.DataFrame(
    {
        "a": [1, 2, 3, 1],
        "b": [["a"], ["b"], ["c"], ["c"]],
    }
)

df.with_columns(pl.col("b").map_elements(lambda e: json.dumps(list(e))))

I considered using to_struct but that doesn't preserve the list structure.

Also the error from my original attempt

df.with_columns(pl.col("b").map_elements(lambda e: json.dumps(e)))

ComputeError: TypeError: Object of type Series is not JSON serializable

I guess internally a list is a series but that was surprising, Ideally the engine would internally cast to a python list before invoking the function being used by map_elements?

@david-waterworth david-waterworth added the enhancement New feature or an improvement of an existing feature label Jan 26, 2024
@cmdlineluser
Copy link
Contributor

Yeah, I think I mentioned it at the time but may have gotten lost in translation.

Essentially the equivalent of:

df = pl.DataFrame({
    "a": [["1", "2"], ["3", "4"]],
    "b": [[dict(B=1)], [dict(B=2)]],
    "c": [dict(C=3), dict(C=4)],
    "d": [datetime.date.today(), None],
    "e": [5, 6]
})

# shape: (2, 5)
# ┌────────────┬─────────────────┬───────────┬────────────┬─────┐
# │ a          ┆ b               ┆ c         ┆ d          ┆ e   │
# │ ---        ┆ ---             ┆ ---       ┆ ---        ┆ --- │
# │ list[str]  ┆ list[struct[1]] ┆ struct[1] ┆ date       ┆ i64 │
# ╞════════════╪═════════════════╪═══════════╪════════════╪═════╡
# │ ["1", "2"] ┆ [{1}]           ┆ {3}       ┆ 2024-01-27 ┆ 5   │
# │ ["3", "4"] ┆ [{2}]           ┆ {4}       ┆ null       ┆ 6   │
# └────────────┴─────────────────┴───────────┴────────────┴─────┘
duckdb.sql("""
from df 
select a::json, b::json, c::json, d::json, e::json
""").pl()

# shape: (2, 5)
# ┌───────────────────┬───────────────────┬───────────────────┬───────────────────┬───────────────────┐
# │ CAST(a AS "json") ┆ CAST(b AS "json") ┆ CAST(c AS "json") ┆ CAST(d AS "json") ┆ CAST(e AS "json") │
# │ ---               ┆ ---               ┆ ---               ┆ ---               ┆ ---               │
# │ str               ┆ str               ┆ str               ┆ str               ┆ str               │
# ╞═══════════════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══════════════════╡
# │ ["1","2"]         ┆ [{"B":1}]         ┆ {"C":3}           ┆ "2024-01-27"      ┆ 5                 │
# │ ["3","4"]         ┆ [{"B":2}]         ┆ {"C":4}           ┆ null              ┆ 6                 │
# └───────────────────┴───────────────────┴───────────────────┴───────────────────┴───────────────────┘

(decoding is .str.json_decode() - it's not struct specific)

(I'm assuming .map_elements requires explicit conversion to list by the user for performance reasons.)

@david-waterworth
Copy link
Author

Yeah after I posted the example I realised I also need to cast/encode a list[struct[*]] as well.

@deanm0000
Copy link
Collaborator

map_elements doesn't work b/c to make that into json it needs the whole column at once and map_elements works on each item at a time.

Instead use map_batches like this:

df.select(pl.col('b').map_batches(lambda x: json.dumps(x.to_list())))

@david-waterworth
Copy link
Author

@deanm0000 thanks for the suggestion but I actually want it to process one item at a time, I don't want a single valid json str for the entire column, I specifically want to convert each element to a json fragment. Both are valid use cases as I often work with jsonlines and in this case I just want to dump the entire frame to csv.

@cmdlineluser
Copy link
Contributor

As far as I can tell, all that is needed is to add this to list.rs

#[cfg(feature = "json")]
pub(super) fn to_json(s: &Series) -> PolarsResult<Series> {
let ca = s.struct_()?;
let dtype = ca.dtype().to_arrow(true);
let iter = ca.chunks().iter().map(|arr| {
let arr = arrow::compute::cast::cast_unchecked(arr.as_ref(), &dtype).unwrap();
polars_json::json::write::serialize_to_utf8(arr.as_ref())
});
Ok(StringChunked::from_chunk_iter(ca.name(), iter).into_series())
}

and change let ca = s.struct_()?; to let ca = s.list()?;

But perhaps someone can answer if this should be implemented as Expr.json_encode() instead?

And not limited to lists/structs?

@DeflateAwning
Copy link
Contributor

DeflateAwning commented May 15, 2024

Seems to be a duplicate of #8482.

Seems like a pretty easy solution. Would be awesome to be able to use pl.col('some_list_col').list.json_encode() (and the same for the arr namespace). Can this please be implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants