Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic v2 Overhaul [DNM yet] #321

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

Lnaden
Copy link
Collaborator

@Lnaden Lnaden commented Aug 9, 2023

Description

This PR upgrades the entire QCElemental package to use the Pydantic v2 system, entirely removing any and all calls to the v1 API.

This PR also preserves QCElemental's API as best as it can, throwing deprecation warnings in places where we may want to upgrade; which can be a point of discussion.

All tests pass locally, however, this should NOT be merged yet until a formal discussion can be had. Given the massive change this PR introduces, and considering how many packages depend on this repo, I want everyone to be sure about a full stack upgrade before this is merged. This is a medium priority, but no rush PR (@loriab and @bennybp specifically)

Happy to talk and explain any and everything.

Changelog description

Updates to Pydantic v2, rendering v1 incompatible.

Status

  • Code base linted
  • Ready to go

@Lnaden Lnaden requested review from loriab and bennybp August 9, 2023 14:21
qcelemental/models/common_models.py Fixed Show fixed Hide fixed
qcelemental/models/types.py Fixed Show fixed Hide fixed
Lnaden added 2 commits August 10, 2023 12:05
…ires python 3.7.1+, but apparently pinning python to `^3.7` which is supposed to be equivalent to `>=3.7, <4.0` doesn't let the stack resolve, so pinning python to `^3.7.1` was the correct solution, even though it should have resolved...
@codecov
Copy link

codecov bot commented Aug 10, 2023

Codecov Report

Merging #321 (75e6aee) into master (c9e308c) will increase coverage by 0.04%.
Report is 2 commits behind head on master.
The diff coverage is 82.91%.

Additional details and impacted files

Lnaden added 3 commits August 10, 2023 12:09
…ther. I moved the commented block in case we needed to change it later.
…te, so split it up the hard way since there doesn't appear to be any other way to acess the name.

This is the single most frustrating bug to track down since the error is thrown in pydantic but doesn't backtrace through the pydantic.__getattr__ -> `AutoPydanticDocGenerator` -> `__get__` -> `doc_formatter` -> `annotation.__name__` function because all the exceptions are caught the whole way up. GYA!
@Lnaden
Copy link
Collaborator Author

Lnaden commented Aug 10, 2023

e32570b might be the most frustrating little bug i fixed ever because we and pydantic do such a good job of trapping bugs all the way up the stack, and the actual issue was Python less than 3.10.

Copy link
Collaborator

@loriab loriab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks very good to me. (Though basemodels.py is beyond my understanding.) I think I've picked up enough to migrate psi4's pydantic usage.

I'm not sure what a good plan for next steps is. Merge to a pydantic-next branch on the MolSSI repo while the rest of the stack builds out? conda packages off qcarchive?

@@ -13,7 +13,6 @@ jobs:
fail-fast: true
matrix:
python-version: ["3.7", "3.9", "3.11"]
pydantic-version: ["1", "2"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to leave this line in for now to help the CI lanes "Required" labels cope.

@@ -65,18 +84,24 @@ def parse_raw(cls, data: Union[bytes, str], *, encoding: Optional[str] = None) -
raise TypeError("Input is neither str nor bytes, please specify an encoding.")

if encoding.endswith(("json", "javascript", "pickle")):
return super().parse_raw(data, content_type=encoding)
# return super().parse_raw(data, content_type=encoding)
return cls.model_validate_json(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should model_validate_json work on javascript or pickle? (not that I know who uses javscript or pickle with this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched over to autodoc-pydantic Spring 2022 (or thereabouts), so I should check if this is still in use at all. Native Sphinx on pydantic is awful, so DGAS had something better, but autodoc-pydantic is far better still.

Lnaden added 2 commits August 11, 2023 10:48
…ping and pre Python3.9 in tandem with typing_extensions Annotation

Source information: beartype/beartype#42

Annotated checks against instances of _GenericAlias to see if you can support Data[type_info], e.g. NDArray[int] (in Python types since 3.9, and kinda of as _GenricAlias before that.) Prior to 3.9, Numpy implemented their own version of _GenericAlias which isn't Python _GenericAlias, so the types are not considered "Generic" when Annotated[NDArray, Metadata][type_info] does its thing. So. This fix code does the heavy lifting to re-cast the NDArray type with _GenericAlias from python typing. I've tried to reuse as much data from NDArray as I possibly can and still use np.ndarray (which is not np.typing.NDArray) to still correctly type hint np.ndarrays.
@Lnaden
Copy link
Collaborator Author

Lnaden commented Aug 11, 2023

I have just spend 2 whole days chasing down the two most subtle interactions ever: One was in the autodoc code (pre Python 3.10 issue with annotations), and one was with Annotated + Numpy.typing.NDArray in pre Python 3.9 and how Annotated does instance checking against Python's native type _GenericAlias objects but NumPy implemented their own. GYAH.

I'll be able to cycle back around to your comments now @loriab.

@loriab
Copy link
Collaborator

loriab commented Aug 15, 2023

Thanks for all your work on this so far, @Lnaden. I've updated through psi4 with pydantic v2 API and hit a couple serialization bugs/changes that I hope you can either fix or advise on.

(1) The Datum class doesn't like to be json serialized. (This shows up in psi4 vibrational analysis -- test tu4.) In qcelemental/tests/test_datum.py, this can be recreated by switching out .json() for .dict(). The error is shown below.

def test_to_dict(dataset):
    listans = [i * 4 / 3 for i in range(4)]
    ans = {"label": "an array", "units": "cm^-1", "data": listans, "comment": "freqs", "numeric": True}

    #dicary = dataset["ndarray"].dict()  # fine
    #dicary = dataset["ndarray"].model_dump()  # fine but fails on compare_recursive. error below
    #dicary = dataset["ndarray"].json()  # error below
    dicary = dataset["ndarray"].model_dump_json()  # error below
    assert compare_recursive(ans, dicary, 9)

# .json() or .model_dump_json() error
#E       pydantic_core._pydantic_core.PydanticSerializationError: Error serializing to JSON: PydanticSerializationError: Unable to serialize unknown type: array([0.        , 1.33333333, 2.66666667, 4.        ])

# .model_dump() error
#E       AssertionError: assert False
#E        +  where False = compare_recursive({'comment': 'freqs', 'data': [0.0, 1.3333333333333333, 2.6666666666666665, 4.0], 'label': 'an array', 'numeric': True, ...}, {'comment': 'freqs', 'data': array([0.        , 1.33333333, 2.66666667, 4.        ]), 'doi': None, 'glossary': '', ...}, 9)

(2) optking is using qcel.util.serialization.json_dumps to make an AtomicResult json serializable. https://github.com/psi-rking/optking/blob/master/optking/compute_wrappers.py#L84-L85 If I switch it to ret = json.loads(ret.model_dump_json()), it's fine (that is, the optimization succeeds, but I didn't examine the intermediates). Should this line need to change? It'd be nice for optking (no pydantic dependence) to not need to branch based on qcel.

…lization wrappers together. Who knew? Also fixed an issue where the default `model_dump` wasn't calling the correct settings.
@Lnaden
Copy link
Collaborator Author

Lnaden commented Aug 15, 2023

@loriab Thanks for testing the downstream things for me to find some of these bugs.

(1) That was indeed a bug, two separate ones that. I had updated the dict method but not the model_dump method for Datum which is not a derivative of ProtoModel, so the code paths were different depending on which you called. I also learned that you cannot chain together WrapSerializer objects the way I had them, so I combined them into one function which fixes your use cases there.

(2) Not sure on what exactly to do here because I don't understand your expected outputs. Can you tell me what the function was expecting type wise and what it gets out in this v2 version? Explicit outputs would be super helpful.

I'm almost certain I can fix this on my end, I just need more details. I know that one difference in Pydantic between v1 and v2 that might impact this is in JSON encoding. In v1, the .json methods took the model and cast it to a full string-ified dict with output type str. in v2, it casts its model to a JSON-compatible dict where all keys and values are str, but returns a dict overall. I don't know if that's related to what you need here?

Thanks again for testing downstream things, this is a huge help in finding subtle interactions like the one from your (1)

@loriab
Copy link
Collaborator

loriab commented Aug 15, 2023

Thanks very much!

(1b) Still on the Datum part, now it gets through all the serializing all the real ndarrays, but it balks on the complex (freq for transition states). If you change dict -> json in test_complex_array, it gives E pydantic_core._pydantic_core.PydanticSerializationError: Error serializing to JSON: PydanticSerializationError: Unable to serialize unknown type: 1j

@@ -133,7 +141,7 @@ def test_complex_array():
         "numeric": True,
     }
 
-    dicary = datum1.dict()
+    dicary = datum1.json()
     assert compare_recursive(ans, dicary, 9)

(1c) This next one (still on Datum), I don't need, and I don't know if its a behavior change, I just thought I'd run it by you since you've got a feel for serialization. It doesn't look like one can recreate the model once serialized to json because arrays end up as strings.

(in test_datum.py)

def test_to_dict(dataset):
    listans = [i * 4 / 3 for i in range(4)]
    ans = {"label": "an array", "units": "cm^-1", "data": listans, "comment": "freqs", "numeric": True}

    jstr = dataset["ndarray"].model_dump_json()
    print(jstr)
    dicary = json.loads(jstr)
#    dicary = dataset["ndarray"].model_dump()
    print(dicary)
    model = qcel.Datum(**dicary)
    print(model.model_dump())
    print(ans)

    assert compare_recursive(ans, dicary, 9)
self = <[AttributeError("'Datum' object has no attribute '__pydantic_extra__'") raised in repr()] Datum object at 0x7fa0246782d0>, label = 'an array', units = 'cm^-1'
data = '[0.0, 1.3333333333333333, 2.6666666666666665, 4.0]'

    def __init__(self, label, units, data, *, comment=None, doi=None, glossary=None, numeric=True):
        kwargs = {"label": label, "units": units, "data": data, "numeric": numeric}
        if comment is not None:
            kwargs["comment"] = comment
        if doi is not None:
            kwargs["doi"] = doi
        if glossary is not None:
            kwargs["glossary"] = glossary
    
>       super().__init__(**kwargs)
E       pydantic_core._pydantic_core.ValidationError: 1 validation error for Datum
E       data
E         Value error, Datum data should be float, Decimal, or np.ndarray, not <class 'str'>. [type=value_error, input_value='[0.0, 1.3333333333333333....6666666666666665, 4.0]', input_type=str]
E           For further information visit https://errors.pydantic.dev/2.1/v/value_error

qcelemental/datum.py:90: ValidationError
-------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------
{"numeric":true,"label":"an array","units":"cm^-1","data":"[0.0, 1.3333333333333333, 2.6666666666666665, 4.0]","comment":"freqs"}
{'numeric': True, 'label': 'an array', 'units': 'cm^-1', 'data': '[0.0, 1.3333333333333333, 2.6666666666666665, 4.0]', 'comment': 'freqs'}

@Lnaden
Copy link
Collaborator Author

Lnaden commented Aug 15, 2023

@loriab

Should have resolved 1b and 1c now.

(1b) Since serialization doesnt chain anymore, I have to handle all cases in one function instead of letting the type detection handle it. Complex NumPy arrays now are cast to list, and then the complexs are also cast to list. So this should be handled.

(1c) The output of JSON strings is now correctly a list instead of a string-ified list. Technically speaking, your code wouldn't work on master anyways since Dataum doesn't take a list input, but the error you pointed out where the json version of data was a str(list(...)) should now be a list(...) correctly.

@loriab
Copy link
Collaborator

loriab commented Aug 15, 2023

1b definitely resolved, thanks!

(1d) What's your view on numpy floats? I can easily solve this in my code by casting to float, but as-is, one can create a Datum with a numpy float but it won't json-ize.

>>> import qcelemental as qcel
>>> rT=0.5
>>> import numpy as np
>>> asdf = qcel.Datum("a np float", "J/K", np.sum(rT / np.expm1(rT) - np.log(1 - np.exp(-rT))))
>>> print(asdf)
----------------------------------------
            Datum a np float            
----------------------------------------
Data:     1.7034991708355878
Units:    [J/K]
doi:      None
Comment:  
Glossary: 
----------------------------------------
>>> type(asdf.data)
<class 'numpy.float64'>
>>> asdf.model_dump()
{'numeric': True, 'label': 'a np float', 'units': 'J/K', 'data': 1.7034991708355878}
>>> asdf.model_dump_json()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/psi/toolchainconda/envs/py311pyd2/lib/python3.11/site-packages/pydantic/main.py", line 345, in model_dump_json
    return self.__pydantic_serializer__.to_json(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.PydanticSerializationError: Error serializing to JSON: PydanticSerializationError: Unable to serialize unknown type: 1.7034991708355878
>>> 

@Lnaden
Copy link
Collaborator Author

Lnaden commented Aug 15, 2023

I think thats something I can fix on my end, it should be casting to float correctly, dunno why its not. Probably has to do with the "automatically inferred output type of the serialization function based on the type hint" and I'm not being generous enough.

@Lnaden
Copy link
Collaborator Author

Lnaden commented Aug 17, 2023

@loriab

(1d) Fixed this by forcing the NumPy scalars to be cast to Python types. Wheeeeeeeeeeeeeee edge cases

This had nothing to do with the type hint return idea unfortunately.

try:
# Cast NumPy scalar data types to native Python data type
v = v.item()
except (AttributeError, ValueError):

Check notice

Code scanning / CodeQL

Empty except

'except' clause does nothing but pass and there is no explanatory comment.
@loriab
Copy link
Collaborator

loriab commented Aug 17, 2023

Confirmed (1d) fixed! thanks, @Lnaden

@loriab
Copy link
Collaborator

loriab commented Aug 18, 2023

Nice strategy with the deprecation warning on .dict and .json, btw. The first time through the full psi4 test suite, I triggered 89k of them :-) . Now fixed except for some choice leftovers so we know it won't break for the user.

@Lnaden
Copy link
Collaborator Author

Lnaden commented Aug 18, 2023

Thanks. I tried to preserve the API on our side as much as possible while keeping the deprecation warning. I can remove the warning on our side and just let people use both. There really shouldn't be any harm in it, so I can cut that if its obnoxious.

@loriab
Copy link
Collaborator

loriab commented Aug 18, 2023

Thanks. I tried to preserve the API on our side as much as possible while keeping the deprecation warning. I can remove the warning on our side and just let people use both. There really shouldn't be any harm in it, so I can cut that if its obnoxious.

I think it's good to have the warning. It makes for a nice warning hunt :-)

I agree on allowing both for the foreseeable future. It would have been vexing to hunt all the dict()s down before facing all the real pydantic v2 api issues.

And they're not obnoxious -- pytest collects them very nicely.

tests/pytests/test_addons.py: 3 warnings
tests/pytests/test_detci_opdm.py: 1 warning
tests/pytests/test_standard_suite.py: 26288 warnings
tests/pytests/test_raises.py: 4 warnings
tests/pytests/test_mcmurchie_davidson.py: 1 warning
tests/pytests/test_restart.py: 5 warnings
tests/pytests/test_mints.py: 2 warnings
tests/pytests/test_misc.py: 2 warnings
tests/pytests/test_scf_options.py: 16 warnings
tests/pytests/test_addons_qcschema.py: 9 warnings
tests/pytests/test_qcfractal.py: 224 warnings
tests/pytests/test_molden_writer.py: 9 warnings
tests/pytests/test_nbody.py: 708 warnings
tests/pytests/test_mp2.py: 6 warnings
tests/pytests/test_dft_benchmarks.py: 2208 warnings
tests/pytests/test_mp2d.py: 34 warnings
tests/pytests/test_ccresponse.py: 2 warnings
tests/pytests/test_qcschema.py: 26 warnings
tests/pytests/test_composite.py: 26 warnings
tests/pytests/test_compositejk.py: 112 warnings
tests/pytests/test_nbody_multi_level_qcschema.py: 252 warnings
tests/pytests/test_option.py: 1 warning
tests/pytests/test_optking.py: 177 warnings
tests/pytests/test_psi4.py: 30 warnings
tests/pytests/test_psi4_qcschema.py: 34 warnings
tests/pytests/test_pyside_cubegen.py: 1 warning
tests/pytests/test_dft_blocking.py: 10 warnings
tests/pytests/test_dipoles.py: 60 warnings
tests/pytests/test_erisieve.py: 11 warnings
tests/pytests/test_fchk_writer.py: 12 warnings
tests/pytests/test_fcidump_energy.py: 2 warnings
tests/pytests/test_geometric.py: 31 warnings
tests/pytests/test_gradients.py: 403 warnings
tests/pytests/test_tdscf_excitations.py: 70 warnings
tests/pytests/test_vibanalysis.py: 10374 warnings
tests/pytests/test_tdscf_products.py: 10 warnings
tests/pytests/test_weird_basis.py: 12 warnings
tests/pytests/test_wfn.py: 1 warning
  /psi/gits/QCElemental/qcelemental/models/basemodels.py:133: DeprecationWarning: The `dict` method is deprecated; use `model_dump` instead.
    warnings.warn("The `dict` method is deprecated; use `model_dump` instead.", DeprecationWarning)

…ich calls different logic overall than to_jsonable_python for some reason I don't understand
from pydantic.v1.typing import ReprArgs
except ImportError: # Will also trap ModuleNotFoundError
from pydantic.typing import ReprArgs
ReprArgs = Sequence[Tuple[Optional[str], Any]]

Check notice

Code scanning / CodeQL

Unused global variable

The global variable 'ReprArgs' is not used.
@awvwgk
Copy link
Contributor

awvwgk commented Mar 13, 2024

Is this effort still active? Having QCElemental with Pydantic v2 support would be great, however a smooth transition from v1 to v2 API would be appreciated. Currently QCElemental is relying on the v1 API provided in Pydantic v2, instead of directly switching to the v2 API and removing the support for v1, would it be possible to have both available dependent on which Pydantic API is available? An example for this could be the approach taken in the beanie package.

@loriab
Copy link
Collaborator

loriab commented Mar 15, 2024

Is this effort still active? ...

It's still active, and it works (except for one line with optking). And downstream v2-proper PRs are largely ready for QCEngine and Psi4. It's frustrating that pydantic hasn't provided a per-model upgrade scheme because as is, whole software stacks need to be migrated together. Last I heard, OpenFF was having trouble with their v2 migration. Any updates, @mattwthompson ?

Thanks for the beanie model. I think that'd work pretty cleanly for QCEngine and maybe Psi4. QCElemental might need some completely separate files for v1/v2 since it access low-level pydantic stuff more. But I agree something needs to be done -- I'm translating new v2 models back to v1 just so I can actually use them in the stack.

@mattwthompson
Copy link
Contributor

Our "proper" v2 migration is tied up in a huge mess of things and stalled - the crux of the thing being getting our custom pint registry to work in Pydantic models without breaking all of our downstream users' APIs

https://github.com/openforcefield/openff-models/blob/main/openff/models/types.py#L20-L37

@loriab
Copy link
Collaborator

loriab commented Mar 16, 2024

Thanks for the update, @mattwthompson . After I get through these next papers I'll try a dual v1/v2 version of qcel. But I bet one wants a triple version for complete compatibility -- v1, v2 as v1, v2.

Comment on lines +18 to +19
# Encoders, to be deprecated at some point
ndarray_encoder = {np.ndarray: lambda v: v.flatten().tolist()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used in these models? I see similar code elsewhere (https://github.com/MolSSI/QCElemental/pull/321/files#diff-451348f65352b6379290ba87c966a55901f8fd840a77da1f09bcaf8fbdcc7485L46-L49) but not using this variable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be leftover from when I was doing the initial refactor pass. They changed the way they do encoders for custom stuff to such a degree that I might have added that, then had to change the internals later on and never cycled back to this line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically see the entire "types" file where I had to add a bunch of new validation logic: https://github.com/MolSSI/QCElemental/pull/321/files#diff-47aa66e9bfa663f2f9b398bd05273dad682d226378e90ba94d633f0bbf0a7885R1-R96

This was referenced Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants