-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1622: Fix issues with nested context manager calls #1626
base: main
Are you sure you want to change the base?
#1622: Fix issues with nested context manager calls #1626
Conversation
Thanks @richardt-engineb ! It's a complex change I will ask for a peer review in the chat |
# Summary ## Problem statement The `Resource` class is also a [Context Manager](https://docs.python.org/3/reference/datamodel.html#context-managers). That is, it implements the `__enter()__` and `__exit()__` methods to allow the use of `with Resource(...)` statements. Prior to this PR, there was no limit on nesting `with` statements on the same `Resource`, but this caused problems because while the second `__enter()__` allowed the `Resource` to already be open, the first `__exit()__` would `close()` the Resource while the higher level context would expect it to still be open. This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions. This was made more complex because these `with` statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc. All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality. ## Solution This PR prevents nested contexts being created by throwing an exception when the second, nested, `with` is attempted. This means that code that risks these issues can be quickly identified and resolved during development. The best way to resolve it is to use `Resource.to_copy()` to copy so that the nested `with` is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway. This PR also updates a number of the internal uses of `with` to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called. ## Breaking Change This is technically a breaking change as any external code that was developed using nested `with` statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to use `to_copy()` or similar. However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g. `to_petl()` iterators are now independent rather than causing changes in each other) So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions. # Tests - All existing unit tests run and pass unchanged - New unit tests were added to cover the updated behaviour - These unit tests were confirmed to fail without the updates in this PR (where appropriate). - These unit tests now pass with the updated code. - The original script that identified the issue in frictionlessdata#1622 was run and now gives the correct result (all rows appropriately converted and saved to file)
# Summary This fixes formatting issues identified by the linter. No functional changes are included. # Tests - hatch run lint passes - hatch run test passes (and coverage is sufficient)
# Summary The CI tests identified some issues that don't show up on a normal test run. This commit fixes those issues. It also highlighted that there were numerous areas that didn't have sufficient test coverage for the case that the caller had already opened the resource. The indexer has some notable changes, but the biggest area affected is the parsers when writing from an already opened source. This commit adds unit tests for the index and all the parser formats for this case, and fixes the code to support the lack of nested contexts. # Tests - Setup the required databases for CI by copying the commands in the github actions - Run `hatch run +py=3.11 ci:test` and ensure all tests pass and coverage remains sufficient - Run `hatch run test` in case it is different and ensure all tests pass and coverage remains sufficient This also means that all linting etc. has been run too.
998e6d3
to
2d01036
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never used nested with
statements with Resource
objects, so I do not think this would directly affect our own existing code.
However, by looking the code on this PR I can notice that a lot of internal code of the Frictionless Framework was changed to use Resource.to_copy()
when using context managers. Does copying a Resource object entail copying the data as well or just metadata? If this does indeed create an in-memory copy of the whole resource, including data, this is likely to increase memory usage, which might be an issue if the Resource includes a lot of data.
It would be nice to make a test comparing memory usage and running time before and after the change, especially with large Resources.
Thanks for the commens @augusto-herrmann (and sorry about the delay in replying)
It's good to hear that it wouldn't affect your own use of Frictionless. The main cases that would affect would be where things in the library happen to be called in such a way that nested
def to_copy(self, **options: Any) -> Self:
"""Create a copy from the resource"""
return super().to_copy(
data=self.data,
basepath=self.basepath,
detector=self.detector,
package=self.package,
**options,
) The main thing in there that could take up space is indeed the Are there any tests you would specifically like to see to increase your confidence, or use-cases where huge in-memory data sets are be used in ways that this would make harder? Even in this change, someone with huge in-memory data-sets could use I also considered adding a As I said, thanks for the feedback, and happy to look at any additional tests etc. that you think would conclusively show a meaningful difference that would convince you either way. |
Summary
Problem statement
The
Resource
class is also a Context Manager. That is, it implements the__enter()__
and__exit()__
methods to allow the use ofwith Resource(...)
statements.Prior to this PR, there was no limit on nesting
with
statements on the sameResource
, but this caused problems because while the second__enter()__
allowed theResource
to already be open, the first__exit()__
wouldclose()
the Resource while the higher level context would expect it to still be open.This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions.
This was made more complex because these
with
statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc.All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality.
Solution
This PR prevents nested contexts being created by throwing an exception when the second, nested,
with
is attempted. This means that code that risks these issues can be quickly identified and resolved during development.The best way to resolve it is to use
Resource.to_copy()
to copy so that the nestedwith
is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway.This PR also updates a number of the internal uses of
with
to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called.Breaking Change
This is technically a breaking change as any external code that was developed using nested
with
statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to useto_copy()
or similar.However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g.
to_petl()
iterators are now independent rather than causing changes in each other)So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions.
Tests
-- These unit tests were confirmed to fail without the updates in this PR (where appropriate).
-- These unit tests now pass with the updated code.