Question: can np.nan stand in for nan+/-0? #169

MichaelTiemannOSC · 2023-01-02T09:33:59Z

I'm trying to use uncertainties with Pandas, Pint and Pint-Pandas. Pint-Pandas makes it easy to have quantified values on a column basis that really don't interact much (or at least badly) with other columns.

uncertainties relies of wrappers to do its things, whereas Pint and Pint-Pandas are now very complete in using ExtensionArrays to interact with Pandas. ExtensionArrays define a value for their na_type, which for most things numeric means np.nan.

In my past dealings with uncertainties, the nan for that has been nan+/-0, which has been fine, except that it now makes for difficult promotion rules. If I have an extension array of quantities (tons of CO2, millions of USD, whatever) with normal float64 magnitudes, the correct na_value for that is np.nan. But if I fill the array with uncertainties as magnitudes, the logical na_value would be nan+/-0. But there's no concept of multiple na_value depending on whether there are uncertainties in the mix.

One solution is to just bite the bullet and say "if you use uncertainties anywhere, then every dataframe needs to honor them, meaning that the na_value for ANYTHING is nan+/-0 (and all magnitudes must promote to UFloat)." What I'd like to do is to manage that column-by-column.

Is there a world in which np.nan is a fully adequate value for uncertainties, with whatever promotions/substitutions, etc happening within the wrappers? Or do I need to majorly rethink my approach of layering these various abstractions (uncertainties, quantities, DataFrames) together?

lebigot · 2023-01-02T14:54:47Z

Thanks for the interesting details.

Now, can you show an example of what you want to do? I'm not fully seeing the problem yet (in part because uncertainties automatically promotes NaN to NaN±0).

MichaelTiemannOSC · 2023-01-02T17:55:23Z

Here is some sample code. It does not play well with the current version of pint-pandas 0.3, but I have a pull-request that does make it work (hgrecco/pint-pandas#140). My latest iteration doesn't show the problems I think will exist due to having multiple na_values, but the extra things the code does to smooth things over feels fragile.

import numpy as np
import uncertainties as un
from uncertainties import unumpy as unp
import pandas as pd
import pint
import pint_pandas

from pint import Quantity as Q_
from pint_pandas import PintArray as PA_
from uncertainties import ufloat

def pp_ser(ser):
    print(f"{ser.name} =\n{ser}")
    print(f"data = {ser.values.data}; np.dtype={ser.values.data.dtype}")

pa1 = pd.Series(PA_([1.1, 1.2, np.nan], dtype='pint[m]'), name='pa1')
pa2 = pd.Series(PA_([1.1, 0, 1.3], dtype='pint[m]'), name='pa2')
pa3 = pa1 / pa2
pa3.name = 'pa3'
# This is the simple case of pure float64 magnitudes
print("Pure float64")
pp_ser(pa3)

upa0 = pd.Series(PA_([ufloat(1.1, 0), ufloat(1.2, 0), np.nan, np.nan], dtype='pint[m]'), name='upa0')
# Test out self-promoting NaNs
pp_ser(upa0)

upa1 = pd.Series(PA_([1.1, 1.2, ufloat(np.nan, 0), np.nan], dtype='pint[m]'), name='upa1')
pp_ser(upa1)

upa2 = pd.Series(PA_([ufloat(0.0, 0.0), ufloat(0.0, .1), ufloat(0.0, .2), ufloat(0.0, .3)], dtype='pint[m]'), name='upa2')
pp_ser(upa2)

upa3 = pd.Series(PA_([0.01, 0.02, 0.03, 0.04], dtype='pint[m]'), name='upa3')
pp_ser(upa3)

upa4 = pd.concat([upa0, upa1,upa2,upa3], axis=1)

print(f"upa4.dtypes = {upa4.dtypes}")
for col in upa4.columns:
    print(f"upa4[{col}].values.data = {upa4[col].values.data}")
print(f"upa4.iloc[0] = {upa4.iloc[0]}")
print(f"upa4.T.iloc[0] = {upa4.T.iloc[0]}")

MichaelTiemannOSC · 2023-01-03T09:18:17Z

So...writing up my findings for the day: when PintPandas makes the ndarray for holding values for PintArrays, it's really best to allocate for ufloats if there are any ufloats to be seen, or if there are NaNs, which are the initial values in an "empty" array. If we try too soon to allocate our ndarray for float64 only, especially when all we see are NaNs, those arrays cannot later hold ufloats.

The cost of allocating object arrays is well known (performance), but I'm seeing general happiness whereby dataframes filled with PintArrays do what is expected. What's cool (?) is that one cannot tell off-hand whether a PintArray with dtype='pint[kg]' is a float64-based or uncertainties-based array. It. Just. Works.

There are still lots of edge cases to work out, but at the end of this day, I have something that's largely behaving and it's not throwing 10,000+ warning messages about "units stripped" or "casting to float" or whatever. So that's progress.

MichaelTiemannOSC mentioned this issue Jan 2, 2023

API: distinguish NA vs NaN in floating dtypes pandas-dev/pandas#32265

Open

MichaelTiemannOSC closed this as completed Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: can np.nan stand in for nan+/-0? #169

Question: can np.nan stand in for nan+/-0? #169

MichaelTiemannOSC commented Jan 2, 2023

lebigot commented Jan 2, 2023

MichaelTiemannOSC commented Jan 2, 2023 •

edited

Loading

MichaelTiemannOSC commented Jan 3, 2023

Question: can np.nan stand in for nan+/-0? #169

Question: can np.nan stand in for nan+/-0? #169

Comments

MichaelTiemannOSC commented Jan 2, 2023

lebigot commented Jan 2, 2023

MichaelTiemannOSC commented Jan 2, 2023 • edited Loading

MichaelTiemannOSC commented Jan 3, 2023

MichaelTiemannOSC commented Jan 2, 2023 •

edited

Loading