-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs #191
docs #191
Conversation
Would it be useful to document the general implications and idioms of using ExtensionArrays? I'm gradually learning that when converting an ndarray into a PandasArray, np.nan becomes NA (and vice-versa in the other direction). Helping people understand how NA and np.nan play inside of Quantities, and the most efficient idioms for dealing with them correctly (pd.isna vs. np.isnan) could be very helpful. I could help write it if you tell me where you think it belongs. |
I don't see many issues relating to nans so I'm wondering if you're encountering this because you're doing less typical workflows. It would be worth making an issue with your findings to understand where they're coming from. I expect it to be sometihng to do with PintArrays either having PandasArrays or some form of np.array holding the values. I wonder if a better way for uncertainties would be to create an UncertaintyArray that the PintArray can use for values? |
You are one step ahead of me. Last night I put my finger on what seems to be the last problem in my own test cases (the pint_pandas test cases don't trip it). When pd.merge needs to fill unmatched values with NaNs, it was creating invalid ndarrays due to the NaN value I've created. I'll write up findings when I have more to report, but I think I have a handle on a way forward. Thanks! |
I've made a lot of progress working with pd.NA and reading through dtype("O") and validating values as UFLoat. I think it might be more elegant to create and use an UncertaintyArray, but I want to try to finish what I've almost got working, then discuss how to possibly make it more elegant with an UncertaintyArray type. The test cases I'm looking at right now are the complex128 test cases, which, because they are actually EAs, and not ComplexArray types, are tickling what I've done in unexpected ways. Which is a good way to ensure the robustness of what I'm doing, rather than hiding behind a fresh type (I think). |
has anyone had a chance to look at this? |
I just submitted some fresh changes to enable testing of complex128 for Pandas 2.1.0rc0+96. Over the past few weeks the pandas team have progressively improved underlying code so that as of today, essentially no special adaptations are required. I still need to see if I can similarly simplify uncertainties, but I think that when 2.1 comes out things are going to be a lot simpler (both to document and implement). |
Is there a convenient way I can leave comments inline? Like comments on a pull request? |
in the .rst files that have been added |
OK, so I cloned the repo and made a change to getting started, which in my version reads:
The reason I'm telling you this in a comment and not with a PR is because I DON'T UNDERSTAND GITHUB!!! I really thought I did the right things in terms of cloning, forking, editing, etc., but GitHub insists on doing things most unintuitive to me. If I can get some help sorting out how to put my carefully placed andrewgsavage/pint-pandas repo into a properly described and defined git place that doesn't make it look like the twin of MichaelTiemannOSC/pint-pandas, I'd appreciate it. I do have hgrecco/pint and hgrecco/pint-pandas properly separated. I just somehow didn't say all the right magic when I tried to make a change relative to your repo as my upstream source. |
you can add comments inline by going file changed, clicking a file, then clicking the blue + after hovering over a line |
That's a good solution... |
if you want to make changes in a PR, you'll make a branch under MichaelTiemannOSC/pint-pandas that tracks andrewgsavage/pint-pandas:docs, then make a PR to andrewgsavage/pint-pandas:docs (ie go to https://github.com/andrewgsavage/pint-pandas/pulls ) |
I'll add this bit:
I think this bit is too in detail for the getting started section, but could fit elsewhere
|
Please pass through a spell-check first. I notice I misspelled mistakes! |
|
|
||
- `pint <https://github.com/hgrecco/pint>` Base package | ||
- `pint-pandas <https://github.com/hgrecco/pint-pandas>`_ Pandas integration | ||
- `pint-xarray <https://github.com/xarray-contrib/pint-xarray>`_ Xarray integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openscm-units <https://github.com/openscm/openscm-units>
units related to simple climate modellingiam_units <https://github.com/IAMconsortium/units>
Pint-compatible definitions of energy, climate, and related units to supplement the SI and other units included in Pint's default_en.txt
|
||
The most common issue pint-pandas users encouter is that they have a DataFrame with column that aren't PintArrays. | ||
An obvious indicator is unit strings showing in cells when viewing the DataFrame. | ||
Several pandas operations return numpy arrays of ``Quantity`` objects, which can cause this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quantity objects within Pandas DataFrames (or Series) will behave like Quantities, meaning that they are subject to unit conversion rules and will raise errors when incompatible units are mixed. But these loose Quantities don't offer the elegance or performance optimizations that come from using PintArrays. And they may give strange error messages as Pandas tries to convert incompatible units to dimensionless magnitudes (which is often prohibited by Pint) rather than naming the incompatibility between the two Quantities in question.
Add:
Creating DataFrames from Series
The default operation of Pandas pd.concat
function is to perform row-wise concatenation. When given a list of Series, each of which is backed by a PintArray, this will inefficiently convert all the PintArrays to arrays of object
type, concatenate the several series into a DataFrame with that many rows, and then leave it up to you to convert that DataFrame back into column-wise PintArrays. A much more efficient approach is to concatenate Series in a column-wise fashion:
df = pd.concat(list_of_series, axis=1)
This will preserve all the PintArrays in each of the Series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quantity objects within Pandas DataFrames (or Series) will behave like Quantities, meaning that they are subject to unit conversion rules and will raise errors when incompatible units are mixed. But these loose Quantities don't offer the elegance or performance optimizations that come from using PintArrays. And they may give strange error messages as Pandas tries to convert incompatible units to dimensionless magnitudes (which is often prohibited by Pint) rather than naming the incompatibility between the two Quantities in question.
It took 2-3 reads for me to follow this, referring to quantity objects and loose quantities is ambiguous (could argue PintArray contains quantity objects since getitem returns them). Some code showing your points would be clearer. Can do that in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the benchmark PR I made makes this point really clearly (up to 1000x performance difference). So if/when that lands, we can refer to coding (and performance examples) from that.
Plot twist: the next version of Pandas (2.1.1? 2.2.0?) will allow EAs to support 2d values, which means that the one-dimensional explanations I've given above will no longer be quite correct. Of course pint-pandas could make the decision that PintArrays are only ever one-dimensional, and we can clean up the text to say that, but we could also allow for the possibility that a whole 2-dimensional DataFrame has quantities, and that both rows and columns both allow not only showing quantified rows and columns, but both can have values set within them via .loc and .iloc while retaining their EA nature. |
bors r+ |
Build succeeded! The publicly hosted instance of bors-ng is deprecated and will go away soon. If you want to self-host your own instance, instructions are here. If you want to switch to GitHub's built-in merge queue, visit their help page. |
pre-commit run --all-files
with no errorshttps://pint-pandas.readthedocs.io/en/docs/