-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227
Conversation
docs/reference/settings.rst
Outdated
.. setting:: DOWNLOAD_MAXSIZE | ||
|
||
DOWNLOAD_MAXSIZE | ||
================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to copy the documentation here, we should just mention that these two standard Scrapy settings are supported.
scrapy_zyte_api/responses.py
Outdated
expected_size = None | ||
for header in api_response.get("httpResponseHeaders"): | ||
if header["name"].lower() == "content-length": | ||
expected_size = int(header["value"]) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is computing expected size needed? We already got the httpResponseBody, and can check its real size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to check it first (as this is faster) and if "content-length" is exceed the limit - return without checking the length of the real body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't decode from base64, but use 0.75 approximation, then using content-length will not be any faster - it'd be slower, and also less reliable, as content-length might lie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think to remove this check for content-length
at all?
Actually, I've added this as we consider (and in Scrapy this is also mentioned for DOWNLOAD_MAXSIZE) to check compressed and decompressed data, and only one way that I found how to check compressed data was to check content-length
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, drop it. In Scrapy it's different, because by checking content-length Scrapy can prevent download before it happens.
For decompression, there is also a special support in Scrapy; it's unrelated to content-length. Scrapy decompresses in chunks, and keeps track of the total size of decompressed data. If the size grows over the limit, an exception is raised, and decompression is stopped. See https://github.com/scrapy/scrapy/blob/6d65708cb7f7b49b72fc17486fecfc1caf62e0af/scrapy/utils/_compression.py#L53. This also looks like something we can't apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
scrapy_zyte_api/responses.py
Outdated
(maxsize and expected_size < maxsize) | ||
and (warnsize and expected_size < warnsize) | ||
): | ||
expected_size = len(b64decode(api_response.get("httpResponseBody", b""))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to get size of base64 data without decoding it? Decoding can be costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*.75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(assuming no linebreaks or ignoring them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks fine not to be byte-precise.
scrapy_zyte_api/responses.py
Outdated
warnsize = request.meta.get("download_warnsize", default_warnsize) | ||
|
||
if "browserHtml" in api_response: | ||
expected_size = len(api_response["browserHtml"].encode(_DEFAULT_ENCODING)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here while trying to limit the memory using DOWNLOAD_MAXSIZE we might create an additional memory spike, because we create another duplicate of browserHtml in memory, temporarily.
It seems we need to ensure _response_max_size_exceeded never makes copies of large data from the response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can either use sys.getsizeof (and maybe subtract some fixed overhead Python Unicode objects have), or consider the length of the unicode object instead of the length of the actual data, it could be good enough as well (though worse). Maybe there is some other solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also calculate the size of the response body here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L114
and
here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L145
What do you think if we just move the check into these two functions and check separately? Then return None of the size is big.
In this case we only need to check additionally content-length
in ZyteAPIResponse
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, we already calculate base64 in two places
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197
and
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L127
But not sure if this is an issue and we need to fix it the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach is to calculate the size and decoded/encoded version of the response body here before calling from_api_response
and send the prepared body to from_api_response
. In this case we make this expensive calculation only once and use this calculated body here too https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking about moving it down the stack - check the size of the API response received by the client library before the json decoding.
But it could make the library less compatible with Scrapy. Let's say you have an existing spider, which uses some download limit. You switch to scrapy-zyte-api for downloads, and maybe also enable data extraction. But the API response is larger than that raw response size. So, the limit becomes more aggressive, and you might drop some pages which were working before.
Because of this, the approach you've taken originally - checking httpResponseBody size and browserHtml size, ignoring everything else (e.g. structured data sizes, or screenshot sizes) makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me prepare another version with implementing check here https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197
and
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike a new version is here (in this PR).
Now the number of encoding/decoding operations is the same as before. No additional calculations except for getting the length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
@kmike please take a look - now the check is in handler_download_request. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #227 +/- ##
==========================================
+ Coverage 97.85% 97.87% +0.01%
==========================================
Files 14 14
Lines 1585 1597 +12
Branches 293 296 +3
==========================================
+ Hits 1551 1563 +12
Misses 14 14
Partials 20 20
|
Co-authored-by: Mikhail Korobov <[email protected]>
Co-authored-by: Mikhail Korobov <[email protected]>
@kmike @Gallaecio the PR is ready to be merged. Is it ok to squash+merge it? |
The initial version without fixing tests. The docs is still under fixing.