Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow get_df on all data_types #887

Merged
merged 9 commits into from
Sep 23, 2024
Merged

Conversation

lorenzomag
Copy link
Contributor

What is the problem / what does the code in this PR do

Pandas' data strucutres can contain any type of Python objects. Therefore, there should be no issues in having np.arrays as element of a Pandas DataFrame.
Currently, the implementation of the Pandas method to convert from strucutred arrays to Pandas DataFrames is not optimal (it just doesn't work well). While we wait for them to fix it, I propose a fix myself just for our code.

This PR adds a function to properly convert between Numpy's structured arrays and pandas' DataFrames, allowing the user to use the get_df() method on any data type.

The PR also adds a simple test in test_core.py. Further tests can be implemented if deemed necessary.

Can you briefly describe how it works?

For each column in the strucutred array, the function convert_structured_array_to_dataframe() in strax/utils.py checks the dimensionality of the column data (col.ndim). If col.ndim>1, convert the column data so that each row contains and np.ndarray object that can be held by the Pandas dataframe. Otherwise, leave as is.

For event-type data, nothing changes, and the overhead is basically null. For loading peak-type n-dimensional data, there might be a very slight increase in computation time to convert the structured array to Pandas DataFrame. However, this was not possible before thsi PR, so it's an improvement nonetheless.

Can you give a minimal working example (or illustrate with a figure)?

Setup in montecarlo-development using strax as in the PR:

import strax
import fuse
import getpass
user = getpass.getuser()

st = fuse.context.full_chain_context(
    output_folder=f"/scratch/midway3/{user}/fuse_data", corrections_version=15
)
st.set_config(
    {
        "path": "/project2/lgrandi/xenonnt/simulations/testing",
        "file_name": "pmt_neutrons_100.root",
        "entry_stop": 10,
    }
)

arr = st.get_array(["00000","00001"], "peaks")
df = st.get_df(["00001", "00000"], "peaks")

Both arr and df are populated with the same data.

type(arr[0]["area_per_channel"])

type(df["area_per_channel"][0])

type(df.loc[0, "area_per_channel"])

These all equal numpy.ndarray.

@coveralls
Copy link

coveralls commented Sep 2, 2024

Coverage Status

coverage: 90.327% (+0.05%) from 90.274%
when pulling 471b235 on lorenzomag:df_all_types
into 0cfe543 on AxFoundation:master.

@dachengx
Copy link
Collaborator

dachengx commented Sep 3, 2024

To be honest I am wondering how this PR is necessary. but I will let another also comment about it.

@lorenzomag
Copy link
Contributor Author

lorenzomag commented Sep 3, 2024

Hey, I think it's as necessary as the get_df method itself: it's not. We could have only get_array and have people convert the structured arrays to DataFrames themselves.

The changes in this PR are convenient for anyone who, like myself, prefers to work with pandas DataFrames, regardless of data types.
Currently, get_df is unnecessarily limited, and this removes that limit. That's my case 😊

Copy link
Collaborator

@yuema137 yuema137 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lorenzomag Thanks a lot for initializing this. But I have the same opinion as @dachengx's. In my understanding, the PandaS data frame is designed for relational data on purpose like SQL, which makes it different from the structured array. If we implement a trick like this, many functions intended for relational data frames may fail in an implicit way, which is not ideal. Therefore, I think we probably need a stronger argument to implement this - do you have any examples of how this can solve certain types of practical problems?

@lorenzomag
Copy link
Contributor Author

Hi @yuema137, thanks for your comment. I do understand the issue now.
It is indeed not very pandas-like...

No past use of get_df would be altered by these changes and all the risks you mentioned would arise only once someone uses a DataFrame containing complex python objects as elements without being aware.
I do not think these changes per-se cause any direct issues, but they remove a level of parenting for users who might not know the intended use of Pandas DataFrames and who might try to use, for instance, grouping or aggregations.

I am just more comfortable working with Pandas and would love to have my data returned directly as df regardless of data type, but undestand that you don't want to encourage this use of pandas.

@yuema137
Copy link
Collaborator

@lorenzomag Hi Lorenzo, sorry for getting back to you after a long time...
I personally have no objection to this PR if you could replace the error message with a warning informing people that they are converting an array that contains non-scalar into a data frame.

@lorenzomag
Copy link
Contributor Author

@yuema137, don't worry, this PR is not urgent at all!
And you're right, there should be a warning message to inform the user. Thanks for the suggestion.

I'll add that today.

Copy link
Collaborator

@yuema137 yuema137 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lorenzomag !

@dachengx dachengx merged commit 0696ef5 into AxFoundation:master Sep 23, 2024
8 checks passed
@lorenzomag lorenzomag deleted the df_all_types branch September 25, 2024 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants