Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support uncertainties as EA Dtype #53970

Closed

Conversation

MichaelTiemannOSC
Copy link
Contributor

@MichaelTiemannOSC MichaelTiemannOSC commented Jul 2, 2023

In parallel with Pint and Pint-Pandas, changes to support uncertainties as an EA Dtype.

See hgrecco/pint#1615
and hgrecco/pint-pandas#140

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

In parallel with Pint and Pint-Pandas, changes to support uncertainties as an EA Dtype.

See hgrecco/pint#1615
and hgrecco/pint-pandas#140

Signed-off-by: Michael Tiemann <[email protected]>
Make black and ruff happy.

Signed-off-by: Michael Tiemann <[email protected]>
Somewhere along the line, pandas-dev#39098 was fixed.  Remove the XFAIL mark so we can celebrate one more success!

Signed-off-by: Michael Tiemann <[email protected]>
Fix pylint complaint about useless parent delgation.

Signed-off-by: Michael Tiemann <[email protected]>
@jbrockmendel
Copy link
Member

jbrockmendel commented Jul 3, 2023

IIUC the problem here is isna not recognizing your custom scalar? I think the way to address this is directly in #27462/#27825

My last attempt at an implementation was #51378, which I still think is the way to go.

MichaelTiemannOSC added a commit to MichaelTiemannOSC/pint-pandas that referenced this pull request Jul 5, 2023
With these changes (and pandas-dev/pandas#53970 and hgrecco/pint#1615) the test suite passes or xpasses everything (no failures or error).  Indeed, the uncertainties code has essentially doubled the scope of the test suite (to test with and without it).

The biggest gotcha is that the EA for complex numbers is not compatible with the EA for uncertainties, due to incompatible hacks:

The hack for complex numbers is to np.nan (which is, technically, a complex number) for na_value across all numeric types.  But that doesn't work for uncertainties, because uncertainties doesn't accept np.nan as an uncertain value.

The hack for uncertainties is to use pd.NA for na_value.  This works for Int64, Float64, and uncertainties, but doesn't work for complex (which cannot tolerate NAType).

Some careful subclassing fills in what doesn't easily work, with fixtures to prevent the improper mixing of complex and uncertainty types in the same python environment.  Happy to discuss!

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC MichaelTiemannOSC marked this pull request as ready for review July 5, 2023 23:28
Documentation updated.

Signed-off-by: Michael Tiemann <[email protected]>
rst ``code`` is two backticks (which my pre-commit didn't catch).

Also, try to better guess the proper rst doc pattern for referencing test files in pandas/tests directory.

Signed-off-by: Michael Tiemann <[email protected]>
Try referencing pull request using html reference as there is no extlink definition for `:pull:`

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC
Copy link
Contributor Author

Please consider adding the https://github.com/pandas-dev/pandas/milestone/103 label to this pull-request. There should be no user-visible changes. The 2.1 milestone contains several enhancements that would further enhance the development versions of Pint and Pint-Pandas which await the changes from this pull request (see hgrecco/pint#1615 and hgrecco/pint-pandas#140).

if allow_fill and notna(fill_value):
# NaN with uncertainties is scalar but does not register as `isna`,
# so use fact that NaN != NaN
if allow_fill and notna(fill_value) and fill_value == fill_value:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you even get here with uncertainties? is your EA subclassing BaseMaskedArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EA itself is a very vanilla EA. The test suite instantiates the EA in myriad ways for test purposes. In this case the EA gets instantiated as a MaskedArray (which derives from BaseMaskedArray) for testing MaskedArray functionality. I think the test suite has every right to rigorously instantiate EAs in every way imaginable, which it does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im still confused here. the EA i see in hgrecco/pint-pandas#140 subclasses ExtensionArray but not BaseMaskedArray, so i don't see how the code change here would affect it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK...I looked more deeply into this when I get back the computer where I did original work and didn't upload some notes to the cloud. I'm pretty sure it's due to some conditions created when trying to merge two DataFrames that I dummied up in my notes. I'll create a test case that can go into the Pandas test suite. It won't depend on uncertainties, but if and when used with uncertainties, it should trigger the path in question. Should be done by Monday.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that was a lot more challenging to reconstruct than I expected, but I re-created the condition. TL;DR: now that I use pd.NA as my NA value for my particular EA, this change that caught your eye is indeed unnecessary and I can revert it.

But, since you asked, here's the path I took initially. Starting with PintType we define this:

    @property
    def na_value(self):
        if HAS_UNCERTAINTIES:
            return self.ureg.Quantity(_ufloat_nan, self.units)
        else:
            return self.ureg.Quantity(np.nan, self.units)

The idea (before settling on pd.NA, which works without this particular change) was to use a NaN to which we could cast any numerical type (we cannot downcast uncertainties to float).

In the test code we set up data_missing and data_missing_for_sorting to use ufloat(np.nan, 0) as the NA value where appropriate. The test case test_groupby_agg_extension uses data_for_grouping, which is called without numeric_dtype as None, and this is where we start going off the rails.

Here's the data_for_grouping code:

@pytest.fixture
def data_for_grouping(numeric_dtype, USE_UNCERTAINTIES):
    a = 1.0
    b = 2.0**32 + 1
    c = 2.0**32 + 10
    if USE_UNCERTAINTIES:
        a = a + ufloat(0, 0)
        b = b + ufloat(0, 0)
        c = c + ufloat(0, 0)
        _n = _ufloat_nan
        numeric_dtype = None
    elif numeric_dtype:
        numeric_dtype = dtypemap.get(numeric_dtype, numeric_dtype)
        _n = np.nan
    else:
        _n = pd.NA
    return PintArray.from_1darray_quantity(
        ureg.Quantity(pd.array([b, b, _n, _n, a, a, b, c], dtype=numeric_dtype), ureg.m)
    )

The way the test code is run, because numeric_dtype is None we get:

(Pdb) data_for_grouping
<PintArray>
[4294967297.0, 4294967297.0, <NA>, <NA>, 1.0, 1.0, 4294967297.0, 4294967306.0]
Length: 8, dtype: pint[meter]

When we execute expected = df.iloc[[0, 2, 4, 7]] there comes a point where we come here (/Users/michael/Documents/GitHub/MichaelTiemannOSC/pandas-dev/pandas/core/internals/managers.py(646)<listcomp>()):

644  	        else:
645  	            new_blocks = [
646  ->	                blk.take_nd(
647  	                    indexer,
648  	                    axis=1,
649  	                    fill_value=(
650  	                        fill_value if fill_value is not None else blk.fill_value
651  	                    ),

With fill_value as None, we inherit the fill value from the PintArray, which is the uncertain NaN. Here's the stack trace down from there:

  pandas/core/internals/managers.py(645)reindex_indexer()
-> new_blocks = [
  pandas/core/internals/managers.py(646)<listcomp>()
-> blk.take_nd(
  pandas/core/internals/blocks.py(975)take_nd()
-> new_values = algos.take_nd(
  pandas/core/array_algos/take.py(115)take_nd()
-> return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
  pint-pandas/pint_pandas/pint_array.py(523)take()
-> result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)
  pandas/core/algorithms.py(1307)take()
-> result = take_nd(
  pandas/core/array_algos/take.py(115)take_nd()
-> return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
> pandas/core/arrays/masked.py(879)take()
-> result[fill_mask] = fill_value

In this particular case, because none of the indexes are -1, no filling is going to happen. But should we have things to fill, they'd be filled with the uncertain nan. And because this code really doesn't want to fill add new NaNs to the array, I put this extra defensive code in. And that's how we hit this condition.

I'm going to take this extraneous test out (fill_value == fill_value). Does this answer your question?

Having observed that GH#39098 works in the current release, allow the test that previously XFAIL'd to run (and pass).  This is done to make CI/CD happy, and has nothing to do with changes otherwise present in this PR.

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC
Copy link
Contributor Author

MichaelTiemannOSC commented Jul 14, 2023

Looks like we now depend on #54110 for CI/CD to pass (which is not yet merged).

@MichaelTiemannOSC
Copy link
Contributor Author

I also don't understand this complaint from pylint:

pylint...................................................................Failed
- hook id: pylint
- duration: 793.25s
- exit code: 4

************* Module pandas.tests.extension.json.test_json
pandas/tests/extension/json/test_json.py:[359](https://github.com/pandas-dev/pandas/actions/runs/5553184370/jobs/10141428866?pr=53970#step:12:364):4: W0246: Useless parent or super() delegation in method 'test_groupby_agg_extension' (useless-parent-delegation)

The code I un-commented was original code, and it appears to follow precisely the pattern of all other code in the module. I'm at a loss as to how to proceed because I didn't create the TestGroupby class hierarchy and don't know why this particular case is getting singled out.

@jbrockmendel
Copy link
Member

The code I un-commented was original code

Again, just remove it and the parent class test will run.

@MichaelTiemannOSC
Copy link
Contributor Author

The code I un-commented was original code

Again, just remove it and the parent class test will run.

Ah, now I understand. I was afraid to remove a test that should be run, and did not grok that the test would run in the parent class. I'll re-submit, waiting a few hours in case there's an answer to my pyarrow question.

Removed the superfluous check testing whether fill_value was a NaN or not (in pandas/core/arrays/masked.py::take).

pylint properly flags an unnecessary implementation of TestGroupby.test_groupby_agg_extension (which merely calls the parent method by the same name).  By deleting that implementation, the parent method is called automatically.

Signed-off-by: Michael Tiemann <[email protected]>
Needed in order to fix broken reduce operations (pandas-dev#52788).

Signed-off-by: Michael Tiemann <[email protected]>
@@ -590,7 +590,7 @@ def nkeys(self) -> int:

def get_iterator(
self, data: NDFrameT, axis: AxisInt = 0
) -> Iterator[tuple[Hashable, NDFrameT]]:
) -> Iterator[tuple[Hashable, NDFrameT]]: # Doesn't work with non-hashable EA types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove

@@ -28,7 +28,7 @@


def make_data():
return [True, False] * 4 + [np.nan] + [True, False] * 44 + [np.nan] + [True, False]
return [True, False] * 4 + [pd.NA] + [True, False] * 44 + [pd.NA] + [True, False]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change to track the identical changes in test_integer.py and test_floating.py. The it does not matter whether the NA values in this case are pd.NA, np.nan, or a mixture. But I did think it better to make the three files consistent. I'm happy to revert this change if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this should be removed. It's ideal to minimize extraneous changes

Patches in response to four review comments/questions received.

Signed-off-by: Michael Tiemann <[email protected]>
`_na_value` is a property of ExtensionDtype, not ExtensionArray.

Signed-off-by: Michael Tiemann <[email protected]>
Testing in Python 3.12 reveals that we *do* need to fix this after all.

Signed-off-by: Michael Tiemann <[email protected]>
Alas, _na_value is not a documented field of xarray.  Current code checks do not seem to cover this code path, so the first failure was found attribute checks in doc build process.  Also re-reverting test_boolean.py back to original state.  The python-dev checks first tripped on boolean REPL, not groupby checks.

Signed-off-by: Michael Tiemann <[email protected]>
@@ -160,6 +161,7 @@ Other enhancements
- :meth:`DataFrame.stack` gained the ``sort`` keyword to dictate whether the resulting :class:`MultiIndex` levels are sorted (:issue:`15105`)
- :meth:`DataFrame.unstack` gained the ``sort`` keyword to dictate whether the resulting :class:`MultiIndex` levels are sorted (:issue:`15105`)
- :meth:`DataFrameGroupby.agg` and :meth:`DataFrameGroupby.transform` now support grouping by multiple keys when the index is not a :class:`MultiIndex` for ``engine="numba"`` (:issue:`53486`)
- :meth:`GroupBy.first` if series contains only NA values (which might be NaN), return the first NA value, else return np.nan (see `PR <https://github.com/pandas-dev/pandas/pull/53970>`_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this is decribing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is describing a change of behavior of the GroupBy.first helper function to not assume that np.nan is a valid NA value for the EA in question. The idea of the change (which, if valid, should also be applied to GroupBy.last) is that if we cannot find any non-NA values in the array, but if there are NA values, the return value should be the first (or last) NA value in the array. I did this because I thought that my EA could only deal with NaN values that were UFloats (having a nominal value and a std_dev value). But as I traced through the code and observed things more closely, I see that the EA does have tolerance for np.nan as an element among UFloats, giving np.nan as the nominal value and zero as the std_dev. So I will remove this change (by being less strict in some of my own assumptions in checking EA consistency).

Not much left for this PR after all!

Signed-off-by: Michael Tiemann <[email protected]>
Clean up more doc changes missed in previous commit.

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC MichaelTiemannOSC marked this pull request as draft July 26, 2023 21:27
Turns out that this fixes GH#39098.  Maybe I should resubmit under that guise just this one patch?

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC
Copy link
Contributor Author

This PR is obsolete now.

@MichaelTiemannOSC MichaelTiemannOSC deleted the uncertainties branch July 29, 2023 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants