Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse uncertainties #1615

Merged
merged 28 commits into from
Sep 15, 2023
Merged

Conversation

MichaelTiemannOSC
Copy link
Collaborator

I'd like to discuss and review before committing to "completely cover" all automated test units. I have contributed two test cases that clearly show the filed issues working with the changes. I have run the full test suite, and the one unexpected failure (#783) I cannot explain.

I'm also not quite sure how to approach the docs question. Pint documentation tends to be more on the minimal side, and I don't want to call unnecessary attention to things that should "just work".

@MichaelTiemannOSC
Copy link
Collaborator Author

Related: hgrecco/pint-pandas#140

@hgrecco
Copy link
Owner

hgrecco commented Oct 22, 2022

First, I completely agree with the need of parsing measurements in Pint. I have been looking for this for a while. So thanks, and awesome work.

Having said this, I am not sure about the implementation. Why parsing 8**4 is structurally different from parsing 8+/-4?

In both cases, you have a couple of numbers “connected” by a multichar operator. I guess that the difference is that ** is known by the python tokenizer, while +/- is not.

Take a look at this:

>>> from pint.compat import tokenizer
>>> for p in tokenizer("8**4"):
...  print(p)
...
TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8**4')
TokenInfo(type=54 (OP), string='**', start=(1, 1), end=(1, 3), line='8**4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 3), end=(1, 4), line='8**4')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 4), end=(1, 5), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

So you changed the tokenizer, but in that process it became more difficult to understand, maintain and debug. I wonder if a more sensible approach would not be to add another layer on top of the tokenizer that scans for +/-.

To understand my proposal, let me show you the output of the current tokenizer:

>>> for p in tokenizer("8 + / - 4"):
...  print(p)
...
TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8 + / - 4')
TokenInfo(type=54 (OP), string='+', start=(1, 2), end=(1, 3), line='8 + / - 4')
TokenInfo(type=54 (OP), string='/', start=(1, 4), end=(1, 5), line='8 + / - 4')
TokenInfo(type=54 (OP), string='-', start=(1, 6), end=(1, 7), line='8 + / - 4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 8), end=(1, 9), line='8 + / - 4')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 9), end=(1, 10), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

If we can create function that transform the previous list into:

>>> for p in tokenizer("8 + / - 4"):
...  print(p)
...
TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8 + / - 4')
TokenInfo(type=<NEW NUMBER> (OP), string='+ / -', start=(1, 2), end=(1, 7), line='8 + / - 4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 8), end=(1, 9), line='8 + / - 4')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 9), end=(1, 10), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

(this could also handle the parenthesis)

Then the new tokenizer will just be:

def tokenizer(s):
    return uncertainty_layer(old_tokenizer(s))

(the names are crap, but just as an example) So it is easy to debug and turn on/off if necessary.

If you go this way, I think it will be wise to move the tokenizer function out of the compat module and into pint_eval.

@hgrecco
Copy link
Owner

hgrecco commented Oct 22, 2022

I also think we need to clearly define which uncertainties format we are going to support (and which we are not).

@MichaelTiemannOSC
Copy link
Collaborator Author

Thank you so much for giving it a look.

On the choice of creating a tighter binding than **, consider `x ** 2+/-0 ** (1/2)'. If +/- binds the same as **, it will fail when there's a string of tokens that all have the same precedence (but different associativity).

I don't disagree with you at all that there are better ways to do this. What I wanted to do first was a proof-of-concept, to see whether the bones in Pint and Pint-Pandas were ready for this sort of thing. And I think they are. My principle concern in modifying the tokenizer was out of respect for the fact that Pint does talk to itself using str() and repr, and as such we need to be prepared to handle whatever sorts of formats users may throw at us (including the N(S) notation and paranthesized ± as a unicode symbol).

Please keep the thoughts coming!

@hgrecco
Copy link
Owner

hgrecco commented Oct 22, 2022

On the choice of creating a tighter binding than **, consider `x ** 2+/-0 ** (1/2)'. If +/- binds the same as **, it will fail when there's a string of tokens that all have the same precedence (but different associativity).

I am not sure that I am proposing the same thing. Consider parsing 8 + / - 4:

TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8 + / - 4')
TokenInfo(type=54 (OP), string='+', start=(1, 2), end=(1, 3), line='8 + / - 4')
TokenInfo(type=54 (OP), string='/', start=(1, 4), end=(1, 5), line='8 + / - 4')
TokenInfo(type=54 (OP), string='-', start=(1, 6), end=(1, 7), line='8 + / - 4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 8), end=(1, 9), line='8 + / - 4')

which is transformed to:

TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8 + / - 4')
TokenInfo(type=<NEW NUMBER> (OP), string='+ / -', start=(1, 2), end=(1, 7), line='8 + / - 4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 8), end=(1, 9), line='8 + / - 4')

which is then converted by build_eval_tree to:

ufloat_fromstr("8 +/- 4")

This will automatically support every format supported here. I think we should limit to those (at least for the time being).


Consider now parsing (8 + / - 4) m:

TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='(8 +/- 4) m')
TokenInfo(type=2 (NUMBER), string='8', start=(1, 1), end=(1, 2), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string='+', start=(1, 3), end=(1, 4), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string='/', start=(1, 4), end=(1, 5), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string='-', start=(1, 5), end=(1, 6), line='(8 +/- 4) m')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 7), end=(1, 8), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string=')', start=(1, 8), end=(1, 9), line='(8 +/- 4) m')
TokenInfo(type=1 (NAME), string='m', start=(1, 10), end=(1, 11), line='(8 +/- 4) m')

which will then be converted

TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='(8 +/- 4) m')
TokenInfo(type=2 (NUMBER), string='8', start=(1, 1), end=(1, 2), line='(8 +/- 4) m')
TokenInfo(type=<NEW NUMBER> (OP), string='+ / -', start=(1, 3), end=(1, 6), line='(8 +/- 4) m')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 7), end=(1, 8), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string=')', start=(1, 8), end=(1, 9), line='(8 +/- 4) m')
TokenInfo(type=1 (NAME), string='m', start=(1, 10), end=(1, 11), line='(8 +/- 4) m')

which is then converted by build_eval_tree to:

ufloat_fromstr("8 +/- 4") * ureg.meter

(I am 'paraphrasing' here)

Please note that that Pint supports more numerical types than uncertainties. We need to make this clear in the docs

@MichaelTiemannOSC
Copy link
Collaborator Author

MichaelTiemannOSC commented Nov 17, 2022

I have returned from a long trip and am ready to tackle this now. I like the proposed approach (of creating a layer that can be enabled and disabled) and will give it a go. However, this idea--as I understand it--is a no go, because OP is (for better or worse), number 54:

TokenInfo(type=<NEW NUMBER> (OP), string='+ / -', start=(1, 3), end=(1, 6), line='(8 +/- 4) m')

To give this string a new number would mean it's no longer an OP, but something entirely different, which would mean we'd need to special-case its handling as something-other-than-OP. Would it be your preference to insert a new kind of OP (such as NUMBER_CONSTRUCTOR)? Or, given that OP is OP is OP, leave that alone and focus only on implementing a pre-processor as a general function and then parsing and evaluating the results of the preprocessor separately?

But also, before I go too far, I have to point out that a very common format to get back from Pint (when dealing with large quantities) is more like this:

(8.0 + / - 4.0)e6 t CO2

This creates the problem of deciding when and how to distribute the factor of a million (e6) so as to end up with the result:

ufloat(8.0e6, 4.0e6) t CO2

We cannot simply process what's between the (); rather there's some additional look-ahead and processing necessary. So I'm looking at that.

I should add that I'm a rank notice when it comes to pytest and friends. My plan is to test just tokenization and basic processing of uncertain quantities within the context of test_measurement.py

I will leave it to others to flip the magic switches so that test_umath and friends can, if they want to, test the actual mechanics of uncertainties within Pint. Hope that's OK!

Based on feedback, tokenize uncertainties on top of default tokenizer, not instead of default tokenizer.

Signed-off-by: MichaelTiemann <[email protected]>
@MichaelTiemannOSC
Copy link
Collaborator Author

The original pint tokenizer is _plain_tokenizer and the tokenizer that layers on top to handle uncertainties is uncertainty_tokenizer. I've created a tokenizer variable so that users can switch away from the default this way:

from pint import pint_eval
pint_eval.tokenizer = pint_eval.uncertainty_tokenizer

I'm very open to changing names and/or mechanics of how the tokenizer should be selected. I've not yet started to flesh out some better test cases for test_measurement.py. I'd like to get feedback on these changes first.

from .errors import DefinitionSyntaxError

# For controlling order of operations
_OP_PRIORITY = {
"±": 4,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can go away: we change ± to +/- to avoid gratuitous problems with ERRORTOKEN.

@MichaelTiemannOSC
Copy link
Collaborator Author

Any comments on this PR? I think I made the changes that were requested.

@MichaelTiemannOSC
Copy link
Collaborator Author

I'm still waiting, and my project is still waiting to know if this is ready to enter final approach. We'd really like to add uncertainties to our pint-based work, but are loathe to ask users to install a completely custom version of pint. If there are other changes you would like to see, I would be happy to receive feedback!

_BINARY_OPERATOR_MAP = {
"±": _ufloat,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line "±": _ufloat, can probably also go away, as above.

if tokinfo.type != tokenlib.ENCODING:
yield tokinfo

def uncertainty_tokenizer(input_string):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would uncertainties_tokenizer be a better name?

Signed-off-by: Michael Tiemann <[email protected]>
Fixes problems parsing currency symbols that also show up when dealing with uncertainties.

Signed-off-by: Michael Tiemann <[email protected]>
Handle negative numbers using uncertainty parenthesis notation.

Signed-off-by: Michael Tiemann <[email protected]>
Ahem...use walrus operator for side-effect, not truth value.

Signed-off-by: Michael Tiemann <[email protected]>
… of the exponent, not just in the parsing of the exponent.

i.e., (5.01+/-0.07)e+04

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC
Copy link
Collaborator Author

Can somebody please review?

@jakeogh
Copy link

jakeogh commented Apr 16, 2023

While trying to understand how to fix #1646 I pulled this branch and ran the tests:

Before parse-uncertainties:

platform linux -- Python 3.10.11, pytest-7.3.0, pluggy-1.0.0                                                                                                                                                                                                                     
rootdir: /home/cfg/_myapps/pint                                                                                                                                                                                                                                                  
plugins: typeguard-3.0.2, pkgcore-0.12.20, anyio-3.6.1                                                                                                                                                                                                                           
collected 2080 items / 2 skipped
<snip>
2 failed, 2024 passed, 13 skipped, 11 xfailed, 11 warnings, 32 errors

After parse-uncertainties:

platform linux -- Python 3.10.11, pytest-7.3.0, pluggy-1.0.0                                                                                                                                                                                                                     
rootdir: /home/cfg/_myapps/pint                                                                                                                                                                                                                                                  
plugins: typeguard-3.0.2, pkgcore-0.12.20, anyio-3.6.1                                                                                                                                                                                                                           
collected 2083 items / 2 skipped
<snip>
11 failed, 2018 passed, 13 skipped, 11 xfailed, 13 warnings, 32 errors

pint-convert is unfortunately still broken (I rebased, so no surprise):

$ pint-convert 1ft meter
Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.10/pint-convert", line 109, in <module>
    ureg._units["R_inf"].converter.scale = R_i
  File "<string>", line 4, in __setattr__
dataclasses.FrozenInstanceError: cannot assign to field 'scale'

MichaelTiemannOSC added a commit to MichaelTiemannOSC/pandas that referenced this pull request Jul 2, 2023
In parallel with Pint and Pint-Pandas, changes to support uncertainties as an EA Dtype.

See hgrecco/pint#1615
and hgrecco/pint-pandas#140

Signed-off-by: Michael Tiemann <[email protected]>
MichaelTiemannOSC added a commit to MichaelTiemannOSC/pint-pandas that referenced this pull request Jul 5, 2023
With these changes (and pandas-dev/pandas#53970 and hgrecco/pint#1615) the test suite passes or xpasses everything (no failures or error).  Indeed, the uncertainties code has essentially doubled the scope of the test suite (to test with and without it).

The biggest gotcha is that the EA for complex numbers is not compatible with the EA for uncertainties, due to incompatible hacks:

The hack for complex numbers is to np.nan (which is, technically, a complex number) for na_value across all numeric types.  But that doesn't work for uncertainties, because uncertainties doesn't accept np.nan as an uncertain value.

The hack for uncertainties is to use pd.NA for na_value.  This works for Int64, Float64, and uncertainties, but doesn't work for complex (which cannot tolerate NAType).

Some careful subclassing fills in what doesn't easily work, with fixtures to prevent the improper mixing of complex and uncertainty types in the same python environment.  Happy to discuss!

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC
Copy link
Collaborator Author

Shall we merge?

@hgrecco
Copy link
Owner

hgrecco commented Sep 14, 2023

Yes but Any idea why the test suite is not running for this PR?

@MichaelTiemannOSC
Copy link
Collaborator Author

I don't know how the CI/CD works. I certainly haven't tried to do anything to evade it. Does anybody else know?

@keewis
Copy link
Contributor

keewis commented Sep 15, 2023

as far as I can tell, for PRs of new contributors (as far as the commit history is concerned) the maintainers have to manually confirm every time there's commits.

This is configurable (we have switched it off in xarray), but I think github's intention is to by default avoid running CI with potentially malicious changes (like, cryptocurrency) before having a maintainer review those changes.

@keewis
Copy link
Contributor

keewis commented Sep 15, 2023

it is strange, though, that it doesn't tell you that. Maybe try closing / reopening the PR?

@MichaelTiemannOSC
Copy link
Collaborator Author

Closing PR to reopen it (attempting soft reboot of CI/CD).

@MichaelTiemannOSC
Copy link
Collaborator Author

Reopening PR to attempt soft reboot of CI/CD.

@MichaelTiemannOSC
Copy link
Collaborator Author

It looks like there's a problem with a readthedocs key on Pint's side.

@keewis
Copy link
Contributor

keewis commented Sep 15, 2023

no, it's just that the configuration (.readthedocs.yaml) needs an update. As far as I can tell, system_packages: false can be dropped without any change in behavior. This would be an easy fix if you want to send in a PR.

After that is merged, you also wouldn't have to wait on maintainer approval for CI to run on this PR.

@hgrecco
Copy link
Owner

hgrecco commented Sep 15, 2023

Now it seems to be running. If all test pass, I will merge right away!

@codspeed-hq
Copy link

codspeed-hq bot commented Sep 15, 2023

CodSpeed Performance Report

Merging #1615 will not alter performance

Comparing MichaelTiemannOSC:parse-uncertainties (00f08f3) with master (2852f36)

Summary

✅ 421 untouched benchmarks

@hgrecco
Copy link
Owner

hgrecco commented Sep 15, 2023

Test are failing. There seems to be an error due to missing babel dependency for a test.

Here's hoping this fixes the CI/CD problem with test_1400.

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC
Copy link
Collaborator Author

I'm taking a wild guess that because we now have a test case that depends on babel, we need babel as part of testbase. So I'm adding that there and saying a small prayer. If somebody knows how to actually fix this, I'm all ears!

Removing `system_packages: false` as suggested by @keewis

Signed-off-by: Michael Tiemann <[email protected]>
@MichaelTiemannOSC
Copy link
Collaborator Author

Not my day: lint failed for a reason I don't understand/cannot reproduce: reserveCache failed: Cache already exists. Scope: refs/heads/parse-uncertainties, Key: [xxx], Version: 25d34710bdd63bed70a22b8204c3e1d0942dc237e098600e3896b4334c6d784f

@hgrecco
Copy link
Owner

hgrecco commented Sep 15, 2023

It is better that you use the decorator that is provided for this purpose:

requires_babel = pytest.mark.skipif(

Just import requires_babel and add @requires_babel to any test that requires babel.

Do not add babel as a requirement as it will make all tests slower.

@MichaelTiemannOSC
Copy link
Collaborator Author

Thanks, I'll fix that. There are some new failures I need to fix, which I will:

FAILED pint/testsuite/test_compat.py::test_isnan[False] - TypeError: must be real number, not datetime.datetime
FAILED pint/testsuite/test_compat.py::test_isnan[True] - TypeError: must be real number, not datetime.datetime
FAILED pint/testsuite/test_compat.py::test_zero_or_nan[False] - TypeError: must be real number, not datetime.datetime
FAILED pint/testsuite/test_compat.py::test_zero_or_nan[True] - TypeError: must be real number, not datetime.datetime

Fix isnan to use unp.isnan as appropriate for both duck_array_type and objects of UFloat types.

Fix a minor typo in pint/facets/__init__.py comment.

In test_issue_1400, use decorators to ensure babel library is loaded when needed.

pyproject.toml: revert change to testbase; we fixed with decorators instead.

Signed-off-by: Michael Tiemann <[email protected]>
@hgrecco
Copy link
Owner

hgrecco commented Sep 15, 2023

Looks good! Shall we merge @MichaelTiemannOSC ?

@MichaelTiemannOSC
Copy link
Collaborator Author

Yes, thanks 🙏

@hgrecco hgrecco merged commit 07646d0 into hgrecco:master Sep 15, 2023
34 checks passed
@hgrecco
Copy link
Owner

hgrecco commented Sep 15, 2023

Congratulations and thanks for this excelent work!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pint + uncertainties in 2022
4 participants