Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic side of reusing validators and serializers #10246

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

BoxyUwU
Copy link
Contributor

@BoxyUwU BoxyUwU commented Aug 27, 2024

Change Summary

Python side of pydantic/pydantic-core#1414, implements a new core schema variant for reusing schema, validators, and serializers for some sizeable performance wins right now and also lots of future possibilities for more wins.

Future work

We do not use the new nested-model schema in all applicable circumstances which leaves performance on the table. Below I've outlined what those cases are and why I did so:

  • When generating schema for union choices we fallback to the old schema generation strategy and do not use nested-model as the apply_discriminators logic requires being able to recurse into the models which is not possible with nested-model.
  • When generating schema for root models we do not use nested-model for the usage of the inner model as if we reused schema for it via __pydantic_core_schema__ for a union choice, then once again apply_discriminators logic would not be able to recurse into the model's fields.
  • We also do not use nested-model to refer to generic schemas, I do not believe there to be any problem with using nested-model here conceptually. However, in practice this hits KeyError happening during schema cleaning with generic models #10279 so we fallback to the old generation strategy when encountering usages of generic schemas.
  • The last exception is that when we have a field types as BaseModel, e.g:
    class MyModel(BaseModel):
        field: BaseModel
    We do not use a nested-model for the field's schema, this caused some issues (that I can't quite remember, though it ought to be simple to disable the exception and see what breaks).
  • The schema is only used for models but it should be capable of supporting arbitrary schema/validator/serializer reuse and then we can reuse these things for dataclasses and typed dicts. I avoided implementing this right now since it seemed like a lot of extra work. (Though the name nested-model should be changed before anything is landed here)

I would expect that removing these special cases is a good opportunity for performance improvements. My assumption would be that allowing nested-model in union choices, using nested-model to refer to generic schemas, and reusing schema/validators/serializers on dataclasses/typed dicts, would be the most important wins.

I have opened an issue #10394 to track this future work

Before merging

  • Most of the test failures at this point are due to incorrect handling of json schema generation. The current implementation when encountering a nested-model schema recurses into the model as if the core schema was inlined. This is incorrect when encountering cycles as it has cycle check and also does not create any $defs in the json schema.

    The solution here would be to search for any nested-model in the core schema and create a $def in the json schema for the referenced model. This would allow us to correctly handle mutually recursive model definitions.

  • Remove the commit that changes CI to build against `pydantic_core/boxy/validator_serializer_reuse


Related issue number

Checklist

  • The pull request title is a good summary of the changes - it will be used in the changelog
  • Unit tests for the changes exist
  • Tests pass on CI
  • Documentation reflects the changes where applicable
  • My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

@github-actions github-actions bot added the relnotes-fix Used for bugfixes. label Aug 27, 2024
Copy link

cloudflare-workers-and-pages bot commented Aug 28, 2024

Deploying pydantic-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: f2de5d5
Status: ✅  Deploy successful!
Preview URL: https://a1b64343.pydantic-docs.pages.dev
Branch Preview URL: https://boxy-validator-serializer-re.pydantic-docs.pages.dev

View logs

@davidhewitt davidhewitt force-pushed the boxy/validator_serializer_reuse branch 5 times, most recently from 33c12f5 to 84c1e55 Compare August 28, 2024 14:26
Copy link

codspeed-hq bot commented Aug 28, 2024

CodSpeed Performance Report

Merging #10246 will improve performances by 21.97%

Comparing boxy/validator_serializer_reuse (f2de5d5) with main (8b6d5fc)

Summary

⚡ 3 improvements
✅ 44 untouched benchmarks
🆕 1 new benchmarks
⁉️ 1 dropped benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main boxy/validator_serializer_reuse Change
test_fastapi_startup_perf 254 ms 208.3 ms +21.97%
test_fastapi_startup_perf 31 ms 28.3 ms +9.84%
test_construct_schema 4.9 ms 4.6 ms +5.99%
⁉️ test_nested_model_schema_generation 3.6 ms N/A N/A
🆕 test_nested_schema_generation N/A 3.6 ms N/A

@BoxyUwU BoxyUwU force-pushed the boxy/validator_serializer_reuse branch from 84c1e55 to 34e2488 Compare August 28, 2024 15:12
@BoxyUwU BoxyUwU force-pushed the boxy/validator_serializer_reuse branch from 0085c9d to 3777d9b Compare September 2, 2024 15:41
@BoxyUwU BoxyUwU force-pushed the boxy/validator_serializer_reuse branch from 3777d9b to 359dac5 Compare September 9, 2024 15:55
Comment on lines 837 to 840
def nested_model_schema(self, schema: core_schema.NestedModelSchema) -> JsonSchemaValue:
new_schema = cast('type[BaseModel]', schema['model']).__pydantic_core_schema__
json_schema = self.model_schema(new_schema)
return json_schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct way of handling nested-model schemas when generating json schemas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more on what next steps we might take here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote about this in the PR description


def foo() -> tuple[CoreSchema, SchemaValidator, SchemaSerializer]:
if obj.__pydantic_core_schema__ is None or MockCoreSchema:
obj.model_rebuild()
Copy link
Contributor Author

@BoxyUwU BoxyUwU Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a distinction between calling model_rebuild vs calling rebuild() on the actual Mock objects?

Regardless- we have to build the validator/serializer/schema when they're mocks as otherwise our model will not be able to be validated/serialized/etc as its schema would be incomplete in cases such as the following:

class A(BaseModel):
    field_a: List['B']

class B(BaseModel):
    field_b: A

# validate an instance of `B` before `A` gets a chance to be rebuilt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could just skip this optimization for cases like this to avoid calling model_rebuild from core. @davidhewitt might know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future readers: cc pydantic/pydantic-core#1414 (comment)

Comment on lines +35 to +36
ref: boxy/validator_serializer_reuse
path: pydantic-core
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, we should probably do this against pydantic-core main sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely seems like the kind of thing that could make sense to do before a release

Copy link
Member

@sydney-runkle sydney-runkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General questions :)

One thing that would be super helpful here is if you could write up an issue (on pydantic-core probably) detailing the high level ideas associated with these PRs - what change are you making? Why does it help / matter? What challenges are notable? What are the most important next steps?

pydantic/_internal/_generate_schema.py Outdated Show resolved Hide resolved

def foo() -> tuple[CoreSchema, SchemaValidator, SchemaSerializer]:
if obj.__pydantic_core_schema__ is None or MockCoreSchema:
obj.model_rebuild()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could just skip this optimization for cases like this to avoid calling model_rebuild from core. @davidhewitt might know.

pydantic/_internal/_generate_schema.py Outdated Show resolved Hide resolved
Comment on lines 837 to 840
def nested_model_schema(self, schema: core_schema.NestedModelSchema) -> JsonSchemaValue:
new_schema = cast('type[BaseModel]', schema['model']).__pydantic_core_schema__
json_schema = self.model_schema(new_schema)
return json_schema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more on what next steps we might take here?

Comment on lines +803 to +806
# validating `MyModel` requires building `MyNestedModel` as its validator/serializer is reused.
assert isinstance(MyNestedModel.__pydantic_validator__, SchemaValidator)
assert isinstance(MyNestedModel.__pydantic_serializer__, SchemaSerializer)
assert generate_schema_calls.count == 2, 'Should not build duplicated core schemas'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the removal of the defer_build check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validating/Serializing MyModel reuses the validator/serializer from MyNestedModel, so if they are mocks then we automatically rebuild them for use in validation of MyModel. This means that even though we defer the build of MyNestedModel it gets built anyway later on.

This is a good example of a case where this PR can reduce amount of work we do. MyModel always required the core schema of MyNestedModel in order to validate but previously it was generated twice, once inline in MyModel and once separately when rebuilding MyNestedModel later. Now the schema is only generated once and reused between MyModel and MyNestedModel

@BoxyUwU BoxyUwU force-pushed the boxy/validator_serializer_reuse branch from ab9b6c6 to f2de5d5 Compare September 12, 2024 14:56
@sydney-runkle
Copy link
Member

cc @Viicos, perhaps going to pick up this work after your current feature progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes-fix Used for bugfixes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants