Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pandas uncertainty array #184

Closed
wants to merge 7 commits into from
Closed

Conversation

andrewgsavage
Copy link
Contributor

@MichaelTiemannOSC here's where I got to with creating a pandas extensionarray.

pandas backports taken from
https://github.com/googleapis/python-db-dtypes-pandas/blob/main/db_dtypes/pandas_backports.py

@andrewgsavage
Copy link
Contributor Author

ideally would need to revive this pandas PR
pandas-dev/pandas#45544

the backport of the NDArrayMixin is a bit outdated causing some tests to fail in new pandas versions

@newville
Copy link
Member

newville commented Jan 2, 2024

@andrewgsavage Thanks - this looks interesting. I admit to not knowing much about Pandas development, but I might suggest that the goal here could be to target Pandas 2+ only (or even 2.1+), or perhaps 1.5+. Would that change what needed to be backported?

Perhaps can you give some idea of what versions of Pandas are targeted and/or supported here?

@andrewgsavage
Copy link
Contributor Author

I was using the same approach as python-db-dtypes-pandas, which uses the backported NDArrayMixin and NDArrayBackedExtensionArray to reduce the code needed for the ExtensionArray. The version I'd copied in the backport file is from a few pandas versions ago.

The NDArrayMixin and NDArrayBackedExtensionArray are part of pandas' private namespace pandas.core.arrays._mixins. I swapped to using that, but it is liable to change or move location in future pandas versions.

I'd only expect to support the current pandas version.

@newville
Copy link
Member

newville commented Jan 2, 2024

@andrewgsavage Thanks -- I guess that I would a bit hesitant about expecting to add pandas as a required dependency for uncertainties. The code here also seems to use quite a bit of pandas API code, implying that the people developing and maintaining uncertainties would need to be familiar with pandas API and follow its development.

I wonder if there might be a simpler way. That is, if uncertainties used https://numpy.org/neps/nep-0018-array-function-protocol.html (see #47) to create a UArray with __array_function__() methods, then maybe it would be easier for other nd-array-like projects (xarray, pint, pandas) to wrap "nd-array of values with uncertainties". It looks like this has been discussed off and on for a few years now.

I have to admit, I do not have any experience trying to subclass ndarray or using the __array_function__ protocol, but I would guess that if uncertainties put in the effort to "use numpy __array_function__" to create a modern UArray, then pandas, xarray, ... could use that in their code.

Does that seem reasonable?

@andrewgsavage
Copy link
Contributor Author

pandas would be an optional dependency, or the code in this PR can be moved to a seperate package if that's preferred. We went with the second option for pint-pandas.. If you go with a seperate package you can make me a maintainer for it as I'm following pandas' develop for pint-pandas anyway.

A UArray with __array_function__ is a different topic to this. I'm not familiar with subclassing either. The __array_function__ protocol allows a module to define how numpy functions behave. This is quite a bit of work; each function requires implementing and testing, and there are over 200 functions! Many functions behave in similar ways so code can be shared, but you still end up with a lot of code to write. eg for pint the bulk of the __array_function__ and __array_ufunc__ code is here: https://github.com/hgrecco/pint/blob/master/pint/facets/numpy/numpy_func.py

For interoperating with pandas, you'd still need code for a ExtensionArray similar to what's in this PR. Here the UncertaintyArray is stroing the ufloats in a object dtype numpy array, which could be changed to a UArray in the future. That is to say this PR does not preclude making a UArray.

FYI, pandas and xarray have both created apis to allow other modules like pint-pandas, pint-xarray to be created. These sit outside of pandas or xarray, so the core module maintainers have less to deal with.

I've mostly put this together because @MichaelTiemannOSC has been using uncertainties with pint-pandas and wanting to add code to support uncertainties to it, which I thought was better suited to living in an uncertainties module, and would benefit more people if it allowed people to use uncertainties with pandas without needing the unit support from pint-pandas.

I do need to run this through linting and such before it's properly reviewed! If we go with a seperate module for uncertainties-pandas, it might be easier to set up the CI for that there since this is using pytest but uncertainties is not at the moment.

@newville
Copy link
Member

newville commented Jan 4, 2024

@andrewgsavage OK, thanks. Even as an "optional dependency" (is it fair to ask "Well, is it optional or is it a dependency"?), having that code here implies that the authors/maintainers of uncertainties will assume the responsibility for understanding and maintaining this code.

I would guess that it might be more in keeping with the basic goals of this package to support an object for "ndarray of values with uncertainties" by subclassing ndarray and/or using the _array_function__ protocol - ther is kind of such an object, it just doesn't use that more modern (and presumably more inter-operable) interface. Yes, that would be some work to convert to using that../, but it might make it easier for those interested in making a "Pandas Series of values with uncertainties" and "pint object of value with uncertainties", or xarray, dask, etc.

Copy link

@MichaelTiemannOSC MichaelTiemannOSC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exciting to see this work becoming aligned with mainstream pandas (and other developments). My changes were just enough to make uncertainties work as wrappers within the NumPy / Pandas worlds. This is clearly next-level.

"sum",
"max",
"min",
"mean",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mean is a non-trivail calculation in the world of uncertainties. I don't think uncertainties does it. I've created code that does it:

# https://stackoverflow.com/a/74137209/1291237                                                                                                                                                                                         
def umean(unquantified_data):
    """                                                                                                                                                                                                                                
    Assuming Gaussian statistics, uncertainties stem from Gaussian parent distributions. In such a case,                                                                                                                               
    it is standard to weight the measurements (nominal values) by the inverse variance.                                                                                                                                                
                                                                                                                                                                                                                                       
    Following the pattern of np.mean, this function is really nan_mean, meaning it calculates based on non-NaN values.                                                                                                                 
    If there are no such, it returns np.nan, just like np.mean does with an empty array.                                                                                                                                               
                                                                                                                                                                                                                                       
    This function uses error propagation on the to get an uncertainty of the weighted average.                                                                                                                                         
    :param: A set of uncertainty values                                                                                                                                                                                                
    :return: The weighted mean of the values, with a freshly calculated error term                                                                                                                                                     
    """
    arr = np.array(
	[v if isinstance(v, ITR.UFloat) else ITR.ufloat(v, 0) for v in unquantified_data if not ITR.isnan(v)]
    )
    N = len(arr)
    if N == 0:
        return np.nan
    if N == 1:
        return arr[0]
    nominals = ITR.nominal_values(arr)
    if any(ITR.std_devs(arr) == 0):
        # We cannot mix and match "perfect" measurements with uncertainties                                                                                                                                                            
        # Instead compute the mean and return the "standard error" as the uncertainty                                                                                                                                                  
        # e.g. ITR.umean([100, 200]) = 150 +/- 50                                                                                                                                                                                      
        w_mean = sum(nominals) / N
        w_std = np.std(nominals) / np.sqrt(N - 1)
    else:
        # Compute the "uncertainty of the weighted mean", which apparently                                                                                                                                                             
        # means ignoring whether or not there are large uncertainties                                                                                                                                                                  
        # that should be created by elements that disagree                                                                                                                                                                             
        # e.g. ITR.umean([100+/-1, 200+/-1]) = 150.0+/-0.7 (!)                                                                                                                                                                         
        w_sigma = 1 / sum([1 / (v.s**2) for v in arr])
        w_mean = sum([v.n / (v.s**2) for v in arr]) * w_sigma
        w_std = w_sigma * np.sqrt(sum([1 / (v.s**2) for v in arr]))
    result = ITR.ufloat(w_mean, w_std)
    return result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the sort of thing that __array_function__ would help with, so I could do np.mean(UArray) without needing to understand the uncertainty logic

"max",
"min",
"mean",
# "prod",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason we cannot reduce with prod.


def _validate_scalar(self, value):
"""
Validate and convert a scalar value to datetime64[ns] for storage in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should not reference datetime64[ns]

@andrewgsavage
Copy link
Contributor Author

Just looked at this and realised I'd need to add pandas to the testing matrix, which would slow tests down somewhat.
@lmfit/uncertainties-admins, @MichaelTiemannOSC, any thoughts as to whether this should be its own standalone module? I'm leaning towards a standalone module.

@MichaelTiemannOSC
Copy link

Just looked at this and realised I'd need to add pandas to the testing matrix, which would slow tests down somewhat. @lmfit/uncertainties-admins, @MichaelTiemannOSC, any thoughts as to whether this should be its own standalone module? I'm leaning towards a standalone module.

I would agree, as it would more than double the normal test time (double, because all the tests would have to run with or without uncertainties, and more than double because uncertain magnitudes are slower than float64 magnitudes).

@wshanks
Copy link
Collaborator

wshanks commented Mar 10, 2024

The tests are very fast right now, so test time is not too bad (though I haven't looked at how much time the new tests take). I would say to weigh how much the code would be coupled to uncertainties (sorry, I haven't looked at it closely yet). Will it need coordinated releases with uncertainties to add new features, or easily remain independent? How convenient would it be to have everything in one package for a user? Already there is numpy support included in an optional way.

@newville
Copy link
Member

It would probably be best to test with and without pandas. I don't think testing runtime is currently much of a concern.

But, as elsewhere, this does not seem as high a priority as getting a release with cleaned-up tests and code base. Can this wait?

@andrewgsavage
Copy link
Contributor Author

I've made this a standalone module and released it to pypi so that it's available for people to use now
https://github.com/andrewgsavage/uncertainties-pandas

@newville
Copy link
Member

newville commented Jul 7, 2024

@andrewgsavage Nice! What is your feeling on whether this should be kept as a separate package vs folded into uncertainties as "available if pandas can be imported"? As a "not very often" pandas user, I don't have a strong preference and I would defer to you.

@andrewgsavage
Copy link
Contributor Author

Either option work fine. I've a slight preference for a seperate package as it'll make changing things easier while the package isn't as mature as the rest of uncertainties

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants