Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1622: Fix issues with nested context manager calls #1626

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

richardt-engineb
Copy link
Contributor

Summary

Problem statement

The Resource class is also a Context Manager. That is, it implements the __enter()__ and __exit()__ methods to allow the use of with Resource(...) statements.

Prior to this PR, there was no limit on nesting with statements on the same Resource, but this caused problems because while the second __enter()__ allowed the Resource to already be open, the first __exit()__ would close() the Resource while the higher level context would expect it to still be open.

This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions.

This was made more complex because these with statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc.

All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality.

Solution

This PR prevents nested contexts being created by throwing an exception when the second, nested, with is attempted. This means that code that risks these issues can be quickly identified and resolved during development.

The best way to resolve it is to use Resource.to_copy() to copy so that the nested with is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway.

This PR also updates a number of the internal uses of with to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called.

Breaking Change

This is technically a breaking change as any external code that was developed using nested with statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to use to_copy() or similar.

However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g. to_petl() iterators are now independent rather than causing changes in each other)

So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions.

Tests

  • All existing unit tests run and pass unchanged
  • New unit tests were added to cover the updated behaviour
    -- These unit tests were confirmed to fail without the updates in this PR (where appropriate).
    -- These unit tests now pass with the updated code.
  • The original script that identified the issue in frictionless transform unhandled exception on 200 line csv, but not 197 line csv #1622 was run and now gives the correct result (all rows appropriately converted and saved to file)

@roll
Copy link
Member

roll commented Jan 29, 2024

Thanks @richardt-engineb !

It's a complex change I will ask for a peer review in the chat

# Summary

## Problem statement

The `Resource` class is also a [Context Manager](https://docs.python.org/3/reference/datamodel.html#context-managers).  That is, it implements the
`__enter()__` and `__exit()__` methods to allow the use of
`with Resource(...)` statements.

Prior to this PR, there was no limit on nesting `with` statements on
the same `Resource`, but this caused problems because while the second
`__enter()__` allowed the `Resource` to already be open, the first
`__exit()__` would `close()` the Resource while the higher level context
would expect it to still be open.

This would cause errors like "ValueError: I/O operation on closed file",
or the iterator would appear to start from part way through a file rather
than at the start of the file, and other similar behaviour depending on
the exact locations of the nested functions.

This was made more complex because these `with` statements were often
far removed from each other in the code, hidden behind iterators driven
by generators, etc. They also could have different behaviour depending on
number of rows read, the type of Resource (local file vs inline, etc.),
the different steps in a pipeline, etc. etc.

All this meant that the problem was rare, hard to reduce down to an
obvious reproduction case, and not realistic to expect developers to
understand while developing new functionality.

## Solution

This PR prevents nested contexts being created by throwing an exception
when the second, nested, `with` is attempted.  This means that code that
risks these issues can be quickly identified and resolved during development.

The best way to resolve it is to use `Resource.to_copy()` to copy so that
the nested `with` is acting on an independent view of the same Resource,
which is likely what is intended in most cases anyway.

This PR also updates a number of the internal uses of `with` to work
on a copy of the Resource they are passed so that they are independent
of any external code and what it might have done with the Resource prior
to the library methods being called.

## Breaking Change

This is technically a breaking change as any external code that was
developed using nested `with` statements - possibly deliberately, but
more likely unknowingly not falling into the error cases - will have to
be updated to use `to_copy()` or similar.

However, the library functions have all been updated in a way that
doesn't change their signature or their expected behaviour as documented
by the unit tests. All pre-existing unit tests pass with no changes,
and added unit tests for the specific updated behaviour do not require
any unusual constructs.  It is still possible that some undocumented
and untested side effect behaviours are different than before and any
code relying on those may also be affected (e.g. `to_petl()` iterators are now independent rather than causing changes in each other)

So it is likely that very few actual impacts will occur in real world
code, and the exception thrown does it's best to explain the issue
and suggest resolutions.

# Tests

- All existing unit tests run and pass unchanged
- New unit tests were added to cover the updated behaviour
    - These unit tests were confirmed to fail without the updates
      in this PR (where appropriate).
    - These unit tests now pass with the updated code.
- The original script that identified the issue in frictionlessdata#1622 was run and
  now gives the correct result (all rows appropriately converted and
  saved to file)
# Summary

This fixes formatting issues identified by the linter.  No functional changes are included.

# Tests

- hatch run lint passes
- hatch run test passes (and coverage is sufficient)
# Summary

The CI tests identified some issues that don't show up on a normal test run.  This commit fixes those issues.

It also highlighted that there were numerous areas
that didn't have sufficient test coverage for the case
that the caller had already opened the resource.

The indexer has some notable changes, but the biggest
area affected is the parsers when writing
from an already opened source.  This commit adds
unit tests for the index and all the parser formats for this case,
and fixes the code to support the lack of nested contexts.

# Tests

- Setup the required databases for CI by copying the commands in the github actions
- Run `hatch run +py=3.11 ci:test` and ensure all tests pass and coverage remains sufficient
- Run `hatch run test` in case it is different and ensure all tests pass and coverage remains sufficient

This also means that all linting etc. has been run too.
Copy link
Member

@augusto-herrmann augusto-herrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never used nested with statements with Resource objects, so I do not think this would directly affect our own existing code.

However, by looking the code on this PR I can notice that a lot of internal code of the Frictionless Framework was changed to use Resource.to_copy() when using context managers. Does copying a Resource object entail copying the data as well or just metadata? If this does indeed create an in-memory copy of the whole resource, including data, this is likely to increase memory usage, which might be an issue if the Resource includes a lot of data.

It would be nice to make a test comparing memory usage and running time before and after the change, especially with large Resources.

@richardt-engineb
Copy link
Contributor Author

Thanks for the commens @augusto-herrmann (and sorry about the delay in replying)

I have never used nested with statements with Resource objects, so I do not think this would directly affect our own existing code.

It's good to hear that it wouldn't affect your own use of Frictionless. The main cases that would affect would be where things in the library happen to be called in such a way that nested with statements happen. E.g. when using a pipeline with steps, if they re-uses the same resource object the whole way through (and not copies) you can end up with nested with statements (usually related to to_petl()-based generators etc.)

However, by looking the code on this PR I can notice that a lot of internal code of the Frictionless Framework was changed to use Resource.to_copy() when using context managers. Does copying a Resource object entail copying the data as well or just metadata? If this does indeed create an in-memory copy of the whole resource, including data, this is likely to increase memory usage, which might be an issue if the Resource includes a lot of data.

It would be nice to make a test comparing memory usage and running time before and after the change, especially with large Resources.

to_copy() is:

def to_copy(self, **options: Any) -> Self:
        """Create a copy from the resource"""
        return super().to_copy(
            data=self.data,
            basepath=self.basepath,
            detector=self.detector,
            package=self.package,
            **options,
        )

The main thing in there that could take up space is indeed the data which could noticeably increase the space used. However the mitigation would be that if using very large datasets it seems likely that data would come from loading a file which is only using generators/iterators rather than caching the whole file in memory. And conversely, anywhere that full in-memory data is being used is likely to be small enough that the memory increase is not a concern overall.

Are there any tests you would specifically like to see to increase your confidence, or use-cases where huge in-memory data sets are be used in ways that this would make harder?

Even in this change, someone with huge in-memory data-sets could use open and close explicitly to avoid copies. That wouldn't affect the internals however, so it does become a question of correctness vs memory use. I did consider doing reference counting instead of copying, however that still had the problem where the state of the generators/iterators was inconsistent depending on what else had been called (e.g. pipeline ordering, etc). So e.g. the tests/table/test_to_petl.py tests that I added do not pass without the to_copy() changes I made (and would not pass in a reference counting approach).

I also considered adding a SAFE_ITERATORS or similar flag that would either copy or not copy depending on value, but that also seemed like asking for trouble in the future as when the issues happen they are very subtle and hard to understand. So it would almost need to be a DANGEROUS_THINGS_WILL_EVENTUALLY_GO_WRONG flag or something!

As I said, thanks for the feedback, and happy to look at any additional tests etc. that you think would conclusively show a meaningful difference that would convince you either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

frictionless transform unhandled exception on 200 line csv, but not 197 line csv
3 participants