-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated dataframe interactions #29
Comments
Just opened an issue that is similar to this - #42 CLosing out issue 42 and adding the comment to this issue to consolidate the discussion: Subgrounds should offer dataframe graphql function support for multiple libraries as well, not just Pandas. Currently the only dataframe utility functions are Pandas, found here The current direction of Subgrounds is going towards a multi-client world. One alternative client to the base client would be to utilize polars instead of pandas dataframes. However, currently dataframe_utils.py only offers pandas function helpers, which actively discriminates against using polars with Subgrounds. To utilize subgrounds with polars, examples of functions that need to be constantly defined are fmt_dict_cols and fmt_arr_cols. fmt_dict_cols - required to convert graphql json data into polars dataframe columns Example code:def fmt_dict_cols(df: pl.DataFrame) -> pl.DataFrame:
"""
formats dictionary cols, which are 'structs' in a polars df, into separate columns and renames accordingly.
"""
for column in df.columns:
if isinstance(df[column][0], dict):
col_names = df[column][0].keys()
# rename struct columns
struct_df = df.select(
pl.col(column).struct.rename_fields([f'{column}_{c}' for c in col_names])
)
struct_df = struct_df.unnest(column)
# add struct_df columns to df and
df = df.with_columns(struct_df)
# drop the df column
df = df.drop(column)
return df
def fmt_arr_cols(df: pl.DataFrame) -> pl.DataFrame:
"""
formats lists, which are arrays in a polars df, into separate columns and renames accordingly.
Since there isn't a direct way to convert array -> new columns, we convert the array to a struct and then
unnest the struct into new columns.
"""
# use this logic if column is a list (rows show up as pl.Series)
for column in df.columns:
if isinstance(df[column][0], pl.Series):
# convert struct to array
struct_df = df.select([pl.col(column).arr.to_struct()])
# rename struct fields
struct_df = struct_df.select(
pl.col(column).struct.rename_fields([f"{column}_{i}" for i in range(len(struct_df.shape))])
)
# unnest struct fields into their own columns
struct_df = struct_df.unnest(column)
# add struct_df columns to df and
df = df.with_columns(struct_df)
# drop the df column
df = df.drop(column)
return df
``` |
Is your feature request related to a problem? Please describe.
I cannot use modern dataframe techniques effectively with subgrounds.
Describe the solution you'd like
I would like to leverage modern pandas (2.0) and polars alongside with the arrow data format and duckdb directly when using subgrounds.
query_arrow
or evenquery(format="pandas")
, perhaps a generic query interface similar to howPaginationStrategy
is decided.Theoretically, we could add a new argument to
query
and codify the existingquery
function with a defaultlegacy_query
callable or interface.Describe alternatives you've considered
You can use
query_json
to implement a customquery
but it's undocumented and quite obtuse. It's also quite awkward to navigate the python-ification of data types which often have to be undo'd with polars for exampleAdditional context
query_df
as we currently do. Theoretically,query_df
could be switched to this newquery
interface as a shorthand to maintain backwards compatibility.query_arrow
could easily be converted to apandas>=2.0
orpolars
dataframe without any conversion-loss, etc.Implementation checklist
The text was updated successfully, but these errors were encountered: