Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

PyExplorer · 2024-10-18T08:27:00Z

The initial version without fixing tests. The docs is still under fixing.

kmike · 2024-10-18T08:28:44Z

docs/reference/settings.rst

+.. setting:: DOWNLOAD_MAXSIZE
+
+DOWNLOAD_MAXSIZE
+================


I don't think we need to copy the documentation here, we should just mention that these two standard Scrapy settings are supported.

kmike · 2024-10-18T08:31:10Z

scrapy_zyte_api/responses.py

+        expected_size = None
+        for header in api_response.get("httpResponseHeaders"):
+            if header["name"].lower() == "content-length":
+                expected_size = int(header["value"])
+                break


Why is computing expected size needed? We already got the httpResponseBody, and can check its real size.

The idea was to check it first (as this is faster) and if "content-length" is exceed the limit - return without checking the length of the real body.

If you don't decode from base64, but use 0.75 approximation, then using content-length will not be any faster - it'd be slower, and also less reliable, as content-length might lie.

Do you think to remove this check for content-length at all?
Actually, I've added this as we consider (and in Scrapy this is also mentioned for DOWNLOAD_MAXSIZE) to check compressed and decompressed data, and only one way that I found how to check compressed data was to check content-length.

Yes, drop it. In Scrapy it's different, because by checking content-length Scrapy can prevent download before it happens.

For decompression, there is also a special support in Scrapy; it's unrelated to content-length. Scrapy decompresses in chunks, and keeps track of the total size of decompressed data. If the size grows over the limit, an exception is raised, and decompression is stopped. See https://github.com/scrapy/scrapy/blob/6d65708cb7f7b49b72fc17486fecfc1caf62e0af/scrapy/utils/_compression.py#L53. This also looks like something we can't apply here.

Got it, thanks!

kmike · 2024-10-18T08:32:07Z

scrapy_zyte_api/responses.py

+            (maxsize and expected_size < maxsize)
+            and (warnsize and expected_size < warnsize)
+        ):
+            expected_size = len(b64decode(api_response.get("httpResponseBody", b"")))


Is there a way to get size of base64 data without decoding it? Decoding can be costly.

(assuming no linebreaks or ignoring them)

It looks fine not to be byte-precise.

kmike · 2024-10-18T08:33:34Z

scrapy_zyte_api/responses.py

+    warnsize = request.meta.get("download_warnsize", default_warnsize)
+
+    if "browserHtml" in api_response:
+        expected_size = len(api_response["browserHtml"].encode(_DEFAULT_ENCODING))


Here while trying to limit the memory using DOWNLOAD_MAXSIZE we might create an additional memory spike, because we create another duplicate of browserHtml in memory, temporarily.

It seems we need to ensure _response_max_size_exceeded never makes copies of large data from the response.

I think you can either use sys.getsizeof (and maybe subtract some fixed overhead Python Unicode objects have), or consider the length of the unicode object instead of the length of the actual data, it could be good enough as well (though worse). Maybe there is some other solution.

We also calculate the size of the response body here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L114
and
here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L145
What do you think if we just move the check into these two functions and check separately? Then return None of the size is big.
In this case we only need to check additionally content-length in ZyteAPIResponse.

And, we already calculate base64 in two places
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197
and
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L127
But not sure if this is an issue and we need to fix it the PR.

Another approach is to calculate the size and decoded/encoded version of the response body here before calling from_api_response and send the prepared body to from_api_response. In this case we make this expensive calculation only once and use this calculated body here too https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197.

I was also thinking about moving it down the stack - check the size of the API response received by the client library before the json decoding.

But it could make the library less compatible with Scrapy. Let's say you have an existing spider, which uses some download limit. You switch to scrapy-zyte-api for downloads, and maybe also enable data extraction. But the API response is larger than that raw response size. So, the limit becomes more aggressive, and you might drop some pages which were working before.

Because of this, the approach you've taken originally - checking httpResponseBody size and browserHtml size, ignoring everything else (e.g. structured data sizes, or screenshot sizes) makes sense to me.

Ok, let me prepare another version with implementing check here https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197
and
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L127

@kmike a new version is here (in this PR).
Now the number of encoding/decoding operations is the same as before. No additional calculations except for getting the length.

kmike

Looks good overall.

scrapy_zyte_api/handler.py

scrapy_zyte_api/responses.py

PyExplorer · 2024-11-11T14:33:27Z

@kmike please take a look - now the check is in handler_download_request.

codecov · 2024-11-11T14:52:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.87%. Comparing base (1269ebb) to head (b78cccd).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #227      +/-   ##
==========================================
+ Coverage   97.85%   97.87%   +0.01%     
==========================================
  Files          14       14              
  Lines        1585     1597      +12     
  Branches      293      296       +3     
==========================================
+ Hits         1551     1563      +12     
  Misses         14       14              
  Partials       20       20

Files with missing lines	Coverage Δ
scrapy_zyte_api/handler.py	`95.13% <100.00%> (+0.33%)`	⬆️

scrapy_zyte_api/handler.py

Co-authored-by: Mikhail Korobov <[email protected]>

scrapy_zyte_api/handler.py

Co-authored-by: Mikhail Korobov <[email protected]>

PyExplorer · 2024-11-12T10:11:21Z

@kmike @Gallaecio the PR is ready to be merged. Is it ok to squash+merge it?

PyExplorer added 3 commits October 18, 2024 11:22

initial version for limiting response size

e566fbe

fix doc

f3e36af

add title for DOWNLOAD_MAXSIZE in doc

230a278

PyExplorer requested review from kmike, Gallaecio and wRAR October 18, 2024 08:27

fix formatting DOWNLOAD_WARNSIZE in doc

9e37208

kmike reviewed Oct 18, 2024

View reviewed changes

PyExplorer added 5 commits October 18, 2024 12:42

remove extra doc

5066d6d

restore original doc

988e853

keep original number of encoding/decoding

8f79ab0

fix logic for _body_max_size_exceeded

70e9b55

rewording messages (we don't have expected now)

11dd745

kmike reviewed Oct 31, 2024

View reviewed changes

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

scrapy_zyte_api/responses.py Outdated Show resolved Hide resolved

PyExplorer added 7 commits October 31, 2024 22:24

make args in _process_response are named.

a542bd7

revert original responses.py

f3ee775

move the main logic to handler

69a1979

merge main

4bcabda

fix params order

90f58eb

add tests for _body_max_size_exceeded

d005d45

Merge branch 'refs/heads/main' into limit_max_response_size

64c5af4

PyExplorer added 3 commits November 11, 2024 18:41

formatting

66ae396

fix mypy warn

722b2c9

fix mypy warn for settings

d21697f

PyExplorer requested a review from kmike November 11, 2024 16:43

kmike reviewed Nov 11, 2024

View reviewed changes

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

tune warn message for dropping a response

76e3ed8

kmike reviewed Nov 12, 2024

View reviewed changes

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

Update scrapy_zyte_api/handler.py

d37bb6c

Co-authored-by: Mikhail Korobov <[email protected]>

kmike reviewed Nov 12, 2024

View reviewed changes

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

Update scrapy_zyte_api/handler.py

b78cccd

Co-authored-by: Mikhail Korobov <[email protected]>

kmike merged commit 0d18fb0 into main Nov 12, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

PyExplorer commented Oct 18, 2024 •

edited

Loading

kmike Oct 18, 2024

kmike Oct 18, 2024

PyExplorer Oct 18, 2024

kmike Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024

kmike Oct 18, 2024

PyExplorer Oct 18, 2024

kmike Oct 18, 2024

wRAR Oct 18, 2024

wRAR Oct 18, 2024

kmike Oct 18, 2024

kmike Oct 18, 2024

kmike Oct 18, 2024

PyExplorer Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024

PyExplorer Oct 18, 2024 •

edited

Loading

kmike Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024

PyExplorer Oct 18, 2024 •

edited

Loading

kmike left a comment

PyExplorer commented Nov 11, 2024

codecov bot commented Nov 11, 2024 •

edited

Loading

PyExplorer commented Nov 12, 2024

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

Conversation

PyExplorer commented Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PyExplorer Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PyExplorer Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

kmike Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PyExplorer Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

kmike left a comment

Choose a reason for hiding this comment

PyExplorer commented Nov 11, 2024

codecov bot commented Nov 11, 2024 • edited Loading

Codecov Report

PyExplorer commented Nov 12, 2024

PyExplorer commented Oct 18, 2024 •

edited

Loading

kmike Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024 •

edited

Loading

kmike Oct 18, 2024 •

edited

Loading

PyExplorer Oct 18, 2024 •

edited

Loading

codecov bot commented Nov 11, 2024 •

edited

Loading