Allow get_df on all data_types #887

lorenzomag · 2024-09-02T07:16:03Z

What is the problem / what does the code in this PR do

Pandas' data strucutres can contain any type of Python objects. Therefore, there should be no issues in having np.arrays as element of a Pandas DataFrame.
Currently, the implementation of the Pandas method to convert from strucutred arrays to Pandas DataFrames is not optimal (it just doesn't work well). While we wait for them to fix it, I propose a fix myself just for our code.

This PR adds a function to properly convert between Numpy's structured arrays and pandas' DataFrames, allowing the user to use the get_df() method on any data type.

The PR also adds a simple test in test_core.py. Further tests can be implemented if deemed necessary.

Can you briefly describe how it works?

For each column in the strucutred array, the function convert_structured_array_to_dataframe() in strax/utils.py checks the dimensionality of the column data (col.ndim). If col.ndim>1, convert the column data so that each row contains and np.ndarray object that can be held by the Pandas dataframe. Otherwise, leave as is.

For event-type data, nothing changes, and the overhead is basically null. For loading peak-type n-dimensional data, there might be a very slight increase in computation time to convert the structured array to Pandas DataFrame. However, this was not possible before thsi PR, so it's an improvement nonetheless.

Can you give a minimal working example (or illustrate with a figure)?

Setup in montecarlo-development using strax as in the PR:

import strax
import fuse
import getpass
user = getpass.getuser()

st = fuse.context.full_chain_context(
    output_folder=f"/scratch/midway3/{user}/fuse_data", corrections_version=15
)
st.set_config(
    {
        "path": "/project2/lgrandi/xenonnt/simulations/testing",
        "file_name": "pmt_neutrons_100.root",
        "entry_stop": 10,
    }
)

arr = st.get_array(["00000","00001"], "peaks")
df = st.get_df(["00001", "00000"], "peaks")

Both arr and df are populated with the same data.

type(arr[0]["area_per_channel"])

type(df["area_per_channel"][0])

type(df.loc[0, "area_per_channel"])

These all equal numpy.ndarray.

for more information, see https://pre-commit.ci

coveralls · 2024-09-02T07:35:07Z

coverage: 90.327% (+0.05%) from 90.274%
when pulling 471b235 on lorenzomag:df_all_types
into 0cfe543 on AxFoundation:master.

dachengx · 2024-09-03T15:06:08Z

To be honest I am wondering how this PR is necessary. but I will let another also comment about it.

lorenzomag · 2024-09-03T17:10:03Z

Hey, I think it's as necessary as the get_df method itself: it's not. We could have only get_array and have people convert the structured arrays to DataFrames themselves.

The changes in this PR are convenient for anyone who, like myself, prefers to work with pandas DataFrames, regardless of data types.
Currently, get_df is unnecessarily limited, and this removes that limit. That's my case 😊

yuema137

@lorenzomag Thanks a lot for initializing this. But I have the same opinion as @dachengx's. In my understanding, the PandaS data frame is designed for relational data on purpose like SQL, which makes it different from the structured array. If we implement a trick like this, many functions intended for relational data frames may fail in an implicit way, which is not ideal. Therefore, I think we probably need a stronger argument to implement this - do you have any examples of how this can solve certain types of practical problems?

lorenzomag · 2024-09-05T11:32:29Z

Hi @yuema137, thanks for your comment. I do understand the issue now.
It is indeed not very pandas-like...

No past use of get_df would be altered by these changes and all the risks you mentioned would arise only once someone uses a DataFrame containing complex python objects as elements without being aware.
I do not think these changes per-se cause any direct issues, but they remove a level of parenting for users who might not know the intended use of Pandas DataFrames and who might try to use, for instance, grouping or aggregations.

I am just more comfortable working with Pandas and would love to have my data returned directly as df regardless of data type, but undestand that you don't want to encourage this use of pandas.

yuema137 · 2024-09-20T00:18:56Z

@lorenzomag Hi Lorenzo, sorry for getting back to you after a long time...
I personally have no objection to this PR if you could replace the error message with a warning informing people that they are converting an array that contains non-scalar into a data frame.

lorenzomag · 2024-09-20T00:46:07Z

@yuema137, don't worry, this PR is not urgent at all!
And you're right, there should be a warning message to inform the user. Thanks for the suggestion.

I'll add that today.

for more information, see https://pre-commit.ci

yuema137

Thanks @lorenzomag !

lorenzomag and others added 3 commits September 1, 2024 23:01

Allow get_df on all data_types

3e11535

Add basic test for get_df on peaks data

908fb5d

[pre-commit.ci] auto fixes from pre-commit.com hooks

de4c778

for more information, see https://pre-commit.ci

dachengx requested review from yuema137 and MerzJohannes September 3, 2024 15:05

yuema137 requested changes Sep 4, 2024

View reviewed changes

Merge branch 'master' into df_all_types

cbd72b4

Merge branch 'master' into df_all_types

b41b787

lorenzomag and others added 2 commits September 19, 2024 23:39

Add warning message for non-scalar DataFrame entries

38c48de

[pre-commit.ci] auto fixes from pre-commit.com hooks

9d06ad1

for more information, see https://pre-commit.ci

lorenzomag requested a review from yuema137 September 20, 2024 04:45

lorenzomag and others added 2 commits September 20, 2024 00:21

Reduce line length

b7c981a

[pre-commit.ci] auto fixes from pre-commit.com hooks

471b235

for more information, see https://pre-commit.ci

yuema137 approved these changes Sep 23, 2024

View reviewed changes

dachengx merged commit 0696ef5 into AxFoundation:master Sep 23, 2024
8 checks passed

lorenzomag deleted the df_all_types branch September 25, 2024 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow get_df on all data_types #887

Allow get_df on all data_types #887

lorenzomag commented Sep 2, 2024

coveralls commented Sep 2, 2024 •

edited

Loading

dachengx commented Sep 3, 2024 •

edited

Loading

lorenzomag commented Sep 3, 2024 •

edited

Loading

yuema137 left a comment

lorenzomag commented Sep 5, 2024

yuema137 commented Sep 20, 2024

lorenzomag commented Sep 20, 2024

yuema137 left a comment

Allow get_df on all data_types #887

Allow get_df on all data_types #887

Conversation

lorenzomag commented Sep 2, 2024

What is the problem / what does the code in this PR do

Can you briefly describe how it works?

Can you give a minimal working example (or illustrate with a figure)?

coveralls commented Sep 2, 2024 • edited Loading

dachengx commented Sep 3, 2024 • edited Loading

lorenzomag commented Sep 3, 2024 • edited Loading

yuema137 left a comment

Choose a reason for hiding this comment

lorenzomag commented Sep 5, 2024

yuema137 commented Sep 20, 2024

lorenzomag commented Sep 20, 2024

yuema137 left a comment

Choose a reason for hiding this comment

coveralls commented Sep 2, 2024 •

edited

Loading

dachengx commented Sep 3, 2024 •

edited

Loading

lorenzomag commented Sep 3, 2024 •

edited

Loading