Updated dataframe interactions #29

0xMochan · 2023-05-29T20:52:39Z

Is your feature request related to a problem? Please describe.
I cannot use modern dataframe techniques effectively with subgrounds.

Describe the solution you'd like
I would like to leverage modern pandas (2.0) and polars alongside with the arrow data format and duckdb directly when using subgrounds.

query_arrow or even query(format="pandas"), perhaps a generic query interface similar to how PaginationStrategy is decided.

Theoretically, we could add a new argument to query and codify the existing query function with a default legacy_query callable or interface.

Describe alternatives you've considered
You can use query_json to implement a custom query but it's undocumented and quite obtuse. It's also quite awkward to navigate the python-ification of data types which often have to be undo'd with polars for example

Additional context

The interface for this should be future proof as likely, we'll want to push it over query_df as we currently do. Theoretically, query_df could be switched to this new query interface as a shorthand to maintain backwards compatibility.
- In a theoretical subgrounds 2.0, breaking changes could elevate this interface further.
The arrow data format is an interesting standardization that could be rooted deeper in this interface.
- Something like query_arrow could easily be converted to a pandas>=2.0 or polars dataframe without any conversion-loss, etc.
Likely, we'll want to push an alpha build of this interface for testers

Implementation checklist

Task 1

The text was updated successfully, but these errors were encountered:

Evan-Kim2028 · 2023-09-15T18:10:17Z

Just opened an issue that is similar to this - #42

CLosing out issue 42 and adding the comment to this issue to consolidate the discussion:

Subgrounds should offer dataframe graphql function support for multiple libraries as well, not just Pandas. Currently the only dataframe utility functions are Pandas, found here

The current direction of Subgrounds is going towards a multi-client world. One alternative client to the base client would be to utilize polars instead of pandas dataframes. However, currently dataframe_utils.py only offers pandas function helpers, which actively discriminates against using polars with Subgrounds.

To utilize subgrounds with polars, examples of functions that need to be constantly defined are fmt_dict_cols and fmt_arr_cols.

fmt_dict_cols - required to convert graphql json data into polars dataframe columns
fmt_arr_cols - required to separate graphql json data fields that contain arrays into polars individual dataframe columns.

Example code:

def fmt_dict_cols(df: pl.DataFrame) -> pl.DataFrame:
    """
    formats dictionary cols, which are 'structs' in a polars df, into separate columns and renames accordingly.
    """
    for column in df.columns:
        if isinstance(df[column][0], dict):  
            col_names = df[column][0].keys()
            # rename struct columns
            struct_df = df.select(
                pl.col(column).struct.rename_fields([f'{column}_{c}' for c in col_names])
            )
            struct_df = struct_df.unnest(column)
            # add struct_df columns to df and
            df = df.with_columns(struct_df)
            # drop the df column
            df = df.drop(column)
    
    return df

def fmt_arr_cols(df: pl.DataFrame) -> pl.DataFrame:
    """
    formats lists, which are arrays in a polars df, into separate columns and renames accordingly.
    Since there isn't a direct way to convert array -> new columns, we convert the array to a struct and then
    unnest the struct into new columns.
    """
    # use this logic if column is a list (rows show up as pl.Series)
    for column in df.columns:
        if isinstance(df[column][0], pl.Series):
            # convert struct to array
            struct_df = df.select([pl.col(column).arr.to_struct()])
            # rename struct fields
            struct_df = struct_df.select(
                pl.col(column).struct.rename_fields([f"{column}_{i}" for i in range(len(struct_df.shape))])
            )
            # unnest struct fields into their own columns
            struct_df = struct_df.unnest(column)
            # add struct_df columns to df and
            df = df.with_columns(struct_df)
            # drop the df column
            df = df.drop(column)

    return df
    ```

0xMochan added the enhancement New feature or request label May 29, 2023

0xMochan mentioned this issue Sep 15, 2023

Expand dataframe.utils.py functions #42

Closed

Evan-Kim2028 mentioned this issue Sep 17, 2023

feat: custom subgrounds client for polars #43

Closed

1 task

0xMochan mentioned this issue Nov 8, 2023

feat: custom subgrounds client for polars #48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated dataframe interactions #29

Updated dataframe interactions #29

0xMochan commented May 29, 2023 •

edited

Loading

Evan-Kim2028 commented Sep 15, 2023 •

edited

Loading

Updated dataframe interactions #29

Updated dataframe interactions #29

Comments

0xMochan commented May 29, 2023 • edited Loading

Evan-Kim2028 commented Sep 15, 2023 • edited Loading

Example code:

0xMochan commented May 29, 2023 •

edited

Loading

Evan-Kim2028 commented Sep 15, 2023 •

edited

Loading