Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect breaking changes #394

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

detect breaking changes #394

wants to merge 30 commits into from

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Feb 7, 2024

Implement #334

  • Uses griffe to detect breaking changes since the latest release
  • Allows users to specify an exclusion list that will silence test failures on the introduction of intended breaking changes

@sungwy sungwy marked this pull request as draft February 7, 2024 22:33
@sungwy
Copy link
Collaborator Author

sungwy commented Feb 8, 2024

The CI currently reports 58 breaking changes between 0.5.1 and 976c0ea. (I'm trying to understand in which PR these breaking changes were introduced)

When test_api.py was first written, it detected just 1 breaking change related to the backward incompatible Daft change:

Example stack trace:

assert not [{'kind': <BreakageKind.ATTRIBUTE_CHANGED_VALUE: 'Attribute value was changed'>, 'new_value': ExprDict(keys=['1', '2'], values=[ExprCall(function=ExprName(name='StructType', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))), arguments=[ExprCall(function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))), arguments=[ExprKeyword(name='field_id', value='100', function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='name', value="'file_path'", function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='field_type', value=ExprCall(function=ExprName(name='StringType', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))), arguments=[]), function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='required', value='True', function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='doc', value="'Location URI with FS scheme'", function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))))]), ExprCall(function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))), arguments=[ExprKeyword(name='field_id', value='101', function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='name', value="'file_format'", function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='field_type', value=ExprCall(function=ExprName(name='StringType', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))), arguments=[]), function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='required', value='True', function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='doc', value="'File format name: avro, orc, or parquet'", function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))))]), ExprCall(function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))), arguments=[ExprKeyword(name='field_id', value='102', function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='name', value="'partition'", function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='field_type', value=ExprCall(function=ExprName(name='StructType', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))), arguments=[]), function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='required', value='True', function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='doc', value="'Partition data tuple, schema based on the partition spec'", function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))))]), ExprCall(function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py'))), arguments=[ExprKeyword(name='field_id', value='103', function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='name', value="'record_count'", function=ExprName(name='NestedField', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD/pyiceberg/manifest.py')))), ExprKeyword(name='field_type', value=ExprCall(function=ExprName(name='LongType', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-1b_iv7st/griffe_HEAD

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 12, 2024

Which of these breaking change types should we include in our unit test?

I've currently removed ATTRIBUTE_CHANGED_VALUE type because it was creating a lot of noise in the output.

Example:

{'kind': <BreakageKind.ATTRIBUTE_CHANGED_VALUE: 'Attribute value was changed'>, 'object_path': 'pyiceberg.catalog.hive.DEFAULT_PROPERTIES', 'old_value': ExprDict(keys=["'write.parquet.compression-codec'"], values=["'zstd'"]), 'new_value': ExprDict(keys=[ExprAttribute(values=[ExprName(name='TableProperties', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-v_k3_7xc/griffe_HEAD/pyiceberg/catalog/hive.py'))), ExprName(name='PARQUET_COMPRESSION', parent=ExprName(name='TableProperties', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-v_k3_7xc/griffe_HEAD/pyiceberg/catalog/hive.py'))))])], values=[ExprAttribute(values=[ExprName(name='TableProperties', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-v_k3_7xc/griffe_HEAD/pyiceberg/catalog/hive.py'))), ExprName(name='PARQUET_COMPRESSION_DEFAULT', parent=ExprName(name='TableProperties', parent=Module(PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-v_k3_7xc/griffe_HEAD/pyiceberg/catalog/hive.py'))))])])}

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 13, 2024

Breakage Kind Description Ignore Rationale
PARAMETER_MOVED Positional parameter was moved
PARAMETER_REMOVED Parameter was removed
PARAMETER_CHANGED_KIND Parameter kind was changed
PARAMETER_CHANGED_DEFAULT Parameter default was changed
PARAMETER_CHANGED_REQUIRED Parameter is now required
PARAMETER_ADDED_REQUIRED Parameter was added as required
RETURN_CHANGED_TYPE Return types are incompatible
OBJECT_REMOVED Public object was removed
OBJECT_CHANGED_KIND Public object points to a different kind of object
ATTRIBUTE_CHANGED_TYPE Attribute types are incompatible
ATTRIBUTE_CHANGED_VALUE Attribute value was changed Too noisy, and in most cases not a serious breaking change (like changing a constant value for Table Property)
CLASS_REMOVED_BASE Base class was removed

@Fokko
Copy link
Contributor

Fokko commented Feb 14, 2024

Thanks for setting this up @syun64. This looks great. I think we can just give it a try after the 0.6.0 release and see how noisy it is.

@Fokko Fokko added this to the PyIceberg 0.7.0 release milestone Feb 14, 2024
sungwy and others added 7 commits February 17, 2024 15:16
Bumps [getdaft](https://github.com/Eventual-Inc/Daft) from 0.2.14 to 0.2.15.
- [Release notes](https://github.com/Eventual-Inc/Daft/releases)
- [Commits](Eventual-Inc/Daft@v0.2.14...v0.2.15)

---
updated-dependencies:
- dependency-name: getdaft
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [moto](https://github.com/getmoto/moto) from 5.0.1 to 5.0.2.
- [Release notes](https://github.com/getmoto/moto/releases)
- [Changelog](https://github.com/getmoto/moto/blob/master/CHANGELOG.md)
- [Commits](getmoto/moto@5.0.1...5.0.2)

---
updated-dependencies:
- dependency-name: moto
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.9 to 9.5.10.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](squidfunk/mkdocs-material@9.5.9...9.5.10)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Make the snapshot creation part of the `Transaction`

This is also how it is done in Java, and I really like it
since it allows you to easily queue up updates in a transaction.
For example, an update to the schema.

* Extend the API
@sungwy sungwy marked this pull request as ready for review February 20, 2024 20:31
@Fokko
Copy link
Contributor

Fokko commented Feb 20, 2024

@syun64 I was on a merging spree, can you rebase once more? 😓

nastra and others added 5 commits February 20, 2024 22:24
…credentials/remote signing (apache#436)

* Send X-Iceberg-Access-Delegation header to signal support for vended credentials/remote signing

Clients can optionally send this header to signal which delegated access pattern it can support.
At this point the iceberg-python client can support `vended-credentials` and `remote-signing`,
thus we can always send this header.
Addtional details about this header can be found in the
REST OpenAPI spec: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L1459-L1483

* Update rest.py
* Refresh Auth token on expiry

* Check call count

* Add test to cover retry logic

* Update poetry.lock with tenacity

* Fix tests for Python <= 3.9
* Reuse commit-uuid as the write-uuid

* Fix conflicts

* Cleanup

* cleanup
* update name-mapping

* Update __init__.py

Co-authored-by: Fokko Driesprong <[email protected]>

* Update pyiceberg/table/name_mapping.py

Co-authored-by: Fokko Driesprong <[email protected]>

* validation mode after

* type

---------

Co-authored-by: Fokko Driesprong <[email protected]>
* Feat: Add fail_if_exists param to create_table

* create create_table_if_not_exists method

* fix reset test

* fix mypy check
@Fokko
Copy link
Contributor

Fokko commented Feb 22, 2024

Looks like we broke something already 😸 Can we make a list to allow breaking changes? Similar to https://github.com/apache/parquet-mr/blob/d8396086b3e3fefc6829f8640917c3bbde0fa9c4/pom.xml#L581-L606

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 22, 2024

Looks like we broke something already 😸 Can we make a list to allow breaking changes? Similar to https://github.com/apache/parquet-mr/blob/d8396086b3e3fefc6829f8640917c3bbde0fa9c4/pom.xml#L581-L606

Sure! Thank you for the suggestion. I took a stab at implementing this with a YAML file - let me know what you think!

@sungwy sungwy requested a review from Fokko February 22, 2024 16:59
exclude:
- obj_path: pyiceberg.avro.decoder_fast.CythonBinaryDecoder
kind: CLASS_REMOVED_BASE
- obj_path: pyiceberg.table.create_mapping_from_schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems correct, but not exactly what I expected: fd9dc88#diff-23e8153e0fd497a9212215bd2067068f3b56fa071770c7ef326db3d3d03cee9bL89

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's interesting that it treats imports from other modules to be public objects as well

#

# The format of this file is documented at
# https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

kind: OBJECT_REMOVED
- obj_path: pyiceberg.avro.decoder_fast.BinaryDecoder
kind: OBJECT_REMOVED
- obj_path: pyiceberg.avro.decoder_fast.CythonBinaryDecoder.tell
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that the Cython stuff didn't change.

Copy link
Collaborator Author

@sungwy sungwy Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting pretty interesting!

As you mentioned, from the diff it looks like CythonBinaryDecoder wasn't updated, but when I run help() on pyiceberg.avro.decoder_fast.CythonBinaryDecoder, I'm loading:

class CythonBinaryDecoder(builtins.object)
 |  Implement a BinaryDecoder that reads from an in-memory buffer.
 |  
 |  Methods defined here:
 |  
 |  __reduce__ = __reduce_cython__(...)
 |  
 |  __setstate__ = __setstate_cython__(...)
 |  
 |  read(self, n: 'int')
 |      Read n bytes.
 |  
 |  read_boolean(self) -> 'bool'
 |      Reads a value from the stream as a boolean.
 |      
 |      A boolean is written as a single byte
 |      whose value is either 0 (false) or 1 (true).
 |  
 |  read_bytes(self)
 |      Bytes are encoded as a long followed by that many bytes of data.
 |  
 |  read_double(self)
 |      Reads a value from the stream as a double.
 |      
 |      A double is written as 8 bytes.
 |      The double is converted into a 64-bit integer using a method equivalent to
 |      Java's doubleToLongBits and then encoded in little-endian format.
 |  
 |  read_float(self)
 |      Reads a value from the stream as a float.
 |      
 |      A float is written as 4 bytes.
 |      The float is converted into a 32-bit integer using a method equivalent to
 |      Java's floatToIntBits and then encoded in little-endian format.
...

Which might explain why methods like read_double, read_float are being reported to having their types changed, whereas read_boolean is being reported to having maintained the same interface.

Maybe there was a change that made Griffe take this definition into account, and it wasn't prior to that

Copy link
Collaborator Author

@sungwy sungwy Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so it looks like this false positive is being generated because we are using two separate approaches for loading the Package in griffe. We use load_git to load pyiceberg's latest release directly from Git, and then we use load to load the local implementation of pyiceberg with the coder's current change. Tracking the outputs of CythonBinaryDecoder for the two methods yields different interpretations of the Cython class's interface:

>>> current_git = griffe.load_git("pyiceberg")
>>> current_git['avro']['decoder_fast']['CythonBinaryDecoder'].relative_filepath
PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-v1wuhe7z/griffe_HEAD/pyiceberg/avro/decoder_fast.pyi')
>>> current_git.all_members['avro']['decoder_fast']['CythonBinaryDecoder'].all_members
{'__init__': Function('__init__', 21, 22), 'tell': Function('tell', 24, 25), 'read': Function('read', 27, 28), 'read_boolean': Function('read_boolean', 30, 31), 'read_int': Function('read_int', 33, 34), 'read_ints': Function('read_ints', 36, 37), 'read_int_bytes_dict': Function('read_int_bytes_dict', 39, 40), 'read_bytes': Function('read_bytes', 42, 43), 'read_float': Function('read_float', 45, 46), 'read_double': Function('read_double', 48, 49), 'read_utf8': Function('read_utf8', 51, 52), 'skip': Function('skip', 54, 55), 'skip_int': Function('skip_int', 57, 58), 'skip_boolean': Function('skip_boolean', 60, 61), 'skip_float': Function('skip_float', 63, 64), 'skip_double': Function('skip_double', 66, 67), 'skip_bytes': Function('skip_bytes', 69, 70), 'skip_utf8': Function('skip_utf8', 72, 73)}
>>> current_local = griffe.load("pyiceberg")
>>> current_local['avro']['decoder_fast']['CythonBinaryDecoder'].relative_filepath
PosixPath('pyiceberg/avro/decoder_fast.cpython-38-x86_64-linux-gnu.so')
>>> current_local.all_members['avro']['decoder_fast']['CythonBinaryDecoder'].all_members
{'__doc__': Attribute('__doc__', None, None), '__new__': Function('__new__', None, None), '__pyx_vtable__': Attribute('__pyx_vtable__', None, None), '__reduce__': Function('__reduce__', None, None), '__setstate__': Function('__setstate__', None, None), 'read': Function('read', None, None), 'read_boolean': Function('read_boolean', None, None), 'read_bytes': Function('read_bytes', None, None), 'read_double': Function('read_double', None, None), 'read_float': Function('read_float', None, None), 'read_int': Function('read_int', None, None), 'read_int_bytes_dict': Function('read_int_bytes_dict', None, None), 'read_ints': Function('read_ints', None, None), 'read_utf8': Function('read_utf8', None, None), 'skip': Function('skip', None, None), 'skip_boolean': Function('skip_boolean', None, None), 'skip_bytes': Function('skip_bytes', None, None), 'skip_double': Function('skip_double', None, None), 'skip_float': Function('skip_float', None, None), 'skip_int': Function('skip_int', None, None), 'skip_utf8': Function('skip_utf8', None, None), 'tell': Function('tell', None, None)}

So it looks like only the local load is reading the .so, whereas load_git is reading the pyi file.

Maybe we exclude cython APIs from the test while we figure out how to resolve this issue? @Fokko

@Fokko
Copy link
Contributor

Fokko commented Feb 28, 2024

Hey @syun64 Thanks for adding the yaml, that looks neat. What's your gist of Griffe? It looks like there are already some false positives.

@sungwy
Copy link
Collaborator Author

sungwy commented Feb 29, 2024

Hey @syun64 Thanks for adding the yaml, that looks neat. What's your gist of Griffe? It looks like there are already some false positives.

Thank you for all the feedback @Fokko ! My general impression with griffe is that it's working pretty well, and after carefully reviewing and adding even more breaking changes we've introduced since the last release, I feel convinced that we should include some implementation of this tool.

There are some edge cases that I thought were worth summarizing:

  1. Cython Class interface inference isn't perfect. As I noted in this comment, either the .so or .pyi files are being used. And when .so file is used to interprete the interface, it returns None return types for all of its members:
>>> griffe.load("pyiceberg")['avro']['decoder_fast']['CythonBinaryDecoder'].relative_filepath
PosixPath('pyiceberg/avro/decoder_fast.cpython-38-x86_64-linux-gnu.so')
>>> griffe.load("pyiceberg")['avro']['decoder_fast']['CythonBinaryDecoder']['tell'].returns
>>> griffe.load_git("pyiceberg")['avro']['decoder_fast']['CythonBinaryDecoder'].relative_filepath
PosixPath('/tmp/griffe-worktree-iceberg-python-HEAD-nu9f66g2/griffe_HEAD/pyiceberg/avro/decoder_fast.pyi')
>>> griffe.load_git("pyiceberg")['avro']['decoder_fast']['CythonBinaryDecoder']['tell'].returns
ExprName(name='int', parent=Class('CythonBinaryDecoder', 20, 73))
  1. Import removal is also considered as a breaking change because it's a public function/Class that was available on the module, that's been removed.

Outside of these two edge cases, I think all the exclusions listed are very accurate, and I think the test will require the contributors to think twice about making a breaking change, or to document an intended breaking change into the yaml file so we developers are able to trace down breaking changes a lot easier.

Current pyiceberg-0.6.0 exclusion list:

exclude:
- pyiceberg.table.create_mapping_from_schema # Removed import (Alias) #441
- pyiceberg.table.Table.next_sequence_number # Table -> TableMetadata #471
- pyiceberg.table.Table.new_snapshot_id # Table -> TableMetadata #471
- pyiceberg.table.StaticTable.next_sequence_number # Table -> TableMetadata #471
- pyiceberg.table.StaticTable.new_snapshot_id # Table -> TableMetadata #471
- pyiceberg.table.Transaction.add_snapshot # removed in favor of directly using TableUpdate and TableRequirement #471
- pyiceberg.table.Transaction.set_ref_snapshot # removed in favor of directly using TableUpdate and TableRequirement #471
- pyiceberg.catalog.rest.TokenResponse.* # nullability change #466
- pyiceberg.io.pyarrow.write_file # Parameter change #471
- pyiceberg.cli.console.run # Positional parameter change #472

@sungwy sungwy requested a review from Fokko March 1, 2024 13:26
@sungwy
Copy link
Collaborator Author

sungwy commented Mar 26, 2024

@jaychia 'suggestion -> reorganize modules to the top level

@sungwy
Copy link
Collaborator Author

sungwy commented Apr 30, 2024

@Fokko @HonahX @jaychia - should we remove this from the 0.7.0 milestone? I think we'd benefit from a prolonged discussion and attempt in organizing our API. This feels like an item we just need to complete before the 1.0.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants