-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PI: Don't load entire file into memory when passed file name #2520
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2520 +/- ##
=======================================
Coverage 94.97% 94.97%
=======================================
Files 50 50
Lines 8331 8340 +9
Branches 1669 1669
=======================================
+ Hits 7912 7921 +9
Misses 260 260
Partials 159 159 ☔ View full report in Codecov by Sentry. |
Thanks for the PR. Are the stats correct? You need twice the memory afterwards, thus it would indicate that this is indeed no performance improvement? And could you please have a look at the failing tests? Your changes lead to new test parallelization issues on Windows as each file can be open only once at each point in time. |
Sorry, I got the before & after mixed up. fixed
Yeah, I can do. I'll have a bit more difficulty fixing the windows tests since I don't have a windows box to test on easily but I'll figure something out. |
AFAIK the concurrent access issues will only occur on Windows, but I cannot really state how much this would indeed affect real use-cases. I am not really sure about the fixed tests either - explicitly calling |
pypdf/_reader.py
Outdated
@@ -314,6 +314,7 @@ def __init__( | |||
|
|||
if isinstance(stream, (str, Path)): | |||
stream = open(stream, "rb") # noqa: SIM115 | |||
# Wish I could just close stream in __del__ but that fails a test very strangely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity: Do you have some details about the failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure how much was relevant to drop in the commit but:
when adding a self.stream.close()
in a __del__
function, that does work most of the time.
The one test failure I was seeing was in tests/test_reader.py
, the failing test was test_get_page_of_encrypted_file
but interestingly this would pass on it's own. I narrowed down the source of the issue to the previous test test_issue297
's exception block where the PdfReader()
initializer was failing (that's what the test is testing for) and the __del__
block wasn't being called due to the exception happening in the __init__
.
It's very possible at some point the objects would be GCd but test failures were happening due to dangling file pointers at the following test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to add this to the commit
1a4b1af
to
a0415db
Compare
It's even worst than that, unfortunately! I'm not sure what the reference chain is from «Writer» -> «» If it's any consolation, the test failures are kind of an edge case where:
Sorry for jumping the gun on calling the tests solved! Still iterating on them. |
I could potentially add a Making it a context manager might work too and would mirror |
5c25bc8
to
0786520
Compare
I don't want this merged as it currently is, calling garbage collection manually in tests feels yucky. |
when you call |
See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.
b105b76
to
5209fcd
Compare
I should also add using |
def __deepcopy__(self, memo: Any) -> "IndirectObject": | ||
return IndirectObject(self.idnum, self.generation, self.pdf) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so found about removing deepcopy : some people may use it this could be considered as a regression. If we really want to remove it we shall use the depredication process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pubpub-zz deprecating doesn't really make sense because with this change no objects will ever be deep-copyable, they will always have a reference to a file pointer that can't be pickleable.
The only reason deep copies work now is because the entire source PDF bytestring gets copied over with them, and that only happens when a filename is passed, deepcopying has never worked with a passed file pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way deprecation would work is if you deprecated it in lieu of this PR and then merged these changes in at a later date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you leave __deepcopy__
is there an error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I leave __deepcopy__
with the associated covered tests there is an error, yes.
TypeError: cannot pickle '_io.FileIO' object
TypeError: cannot pickle '_io.FileIO' object
res = hook_impl.function(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/msirabella/fork/pypdf/venv/lib/python3.11/site-packages/_pytest/python.py", line 195, in pytest_pyfunc_call
result = testfunction(**testargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/msirabella/fork/pypdf/tests/test_page.py", line 168, in test_transformation_equivalence
page_box1 = deepcopy(page_box)
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
y = copier(x, memo)
^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 146, in deepcopy
y = copier(x, memo)
^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/copy.py", line 161, in deepcopy
rv = reductor(4)
^^^^^^^^^^^
TypeError: cannot pickle '_io.FileIO' object
have you also been able to advance in your proposal ? |
Hi, sorry, I've been taking a break from things due to mental health but plan to be back on them sometime later next month. Moving this back to draft for now. |
See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.
This breaks if PdfReader contains any un-pickleable attributes (such as file pointers)
Was only ever being used unintentionally in the tests and doesn't really make sense. Use .clone() instead
See py-pdf#2520, basically this was the last failing (only on windows) test because if the pdfreaders are implicitly opening file streams that don't get closed until they get garbage collected the .unlinks() create file lock errors.
This halves allocated memory when doing a simple PdfWriter(clone_from=«str») I can't just close the self.stream in `__del__` because for some strange reason the unit tests mark it as unflagged even after the test block ends. Something about `__del__` finalizers being run on a second pass while `weakref.finalize()` is run on the first pass.
To mirror PdfWriter, also hints towards file pointer management now that we keep files open sometimes.
This functionality originally added back in ced2890
Reduces memory usage by size of loaded file.
Benchmark script
Before stats
After stats