-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227
Changes from 4 commits
e566fbe
f3e36af
230a278
9e37208
5066d6d
988e853
8f79ab0
70e9b55
11dd745
a542bd7
f3ee775
69a1979
4bcabda
90f58eb
d005d45
64c5af4
66ae396
722b2c9
d21697f
76e3ed8
d37bb6c
b78cccd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
import logging | ||
|
||
from base64 import b64decode | ||
from copy import copy | ||
from datetime import datetime | ||
|
@@ -15,6 +17,9 @@ | |
_RESPONSE_HAS_PROTOCOL, | ||
) | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
_DEFAULT_ENCODING = "utf-8" | ||
|
||
|
||
|
@@ -166,10 +171,67 @@ def from_api_response(cls, api_response: Dict, *, request: Request = None): | |
_API_RESPONSE = Dict[str, _JSON] | ||
|
||
|
||
def _check_response_size_limits( | ||
expected_size: int, | ||
warnsize: Optional[int], | ||
maxsize: Optional[int], | ||
request_url: str, | ||
) -> bool: | ||
if warnsize and expected_size > warnsize: | ||
logger.warning( | ||
f"Expected response size {expected_size} larger than " | ||
f"download warn size {warnsize} in request {request_url}." | ||
) | ||
|
||
if maxsize and expected_size > maxsize: | ||
logger.warning( | ||
f"Cancelling download of {request_url}: expected response size " | ||
f"{expected_size} larger than download max size {maxsize}." | ||
) | ||
return False | ||
return True | ||
|
||
|
||
def _response_max_size_exceeded( | ||
api_response: _API_RESPONSE, | ||
request: Request, | ||
default_maxsize: Optional[int], | ||
default_warnsize: Optional[int], | ||
) -> bool: | ||
maxsize = request.meta.get("download_maxsize", default_maxsize) | ||
warnsize = request.meta.get("download_warnsize", default_warnsize) | ||
|
||
if "browserHtml" in api_response: | ||
expected_size = len(api_response["browserHtml"].encode(_DEFAULT_ENCODING)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here while trying to limit the memory using DOWNLOAD_MAXSIZE we might create an additional memory spike, because we create another duplicate of browserHtml in memory, temporarily. It seems we need to ensure _response_max_size_exceeded never makes copies of large data from the response. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you can either use sys.getsizeof (and maybe subtract some fixed overhead Python Unicode objects have), or consider the length of the unicode object instead of the length of the actual data, it could be good enough as well (though worse). Maybe there is some other solution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also calculate the size of the response body here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And, we already calculate base64 in two places There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another approach is to calculate the size and decoded/encoded version of the response body here before calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was also thinking about moving it down the stack - check the size of the API response received by the client library before the json decoding. But it could make the library less compatible with Scrapy. Let's say you have an existing spider, which uses some download limit. You switch to scrapy-zyte-api for downloads, and maybe also enable data extraction. But the API response is larger than that raw response size. So, the limit becomes more aggressive, and you might drop some pages which were working before. Because of this, the approach you've taken originally - checking httpResponseBody size and browserHtml size, ignoring everything else (e.g. structured data sizes, or screenshot sizes) makes sense to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, let me prepare another version with implementing check here https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kmike a new version is here (in this PR). |
||
elif api_response.get("httpResponseHeaders") and api_response.get("httpResponseBody"): | ||
expected_size = None | ||
for header in api_response.get("httpResponseHeaders"): | ||
if header["name"].lower() == "content-length": | ||
expected_size = int(header["value"]) | ||
break | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is computing expected size needed? We already got the httpResponseBody, and can check its real size. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The idea was to check it first (as this is faster) and if "content-length" is exceed the limit - return without checking the length of the real body. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you don't decode from base64, but use 0.75 approximation, then using content-length will not be any faster - it'd be slower, and also less reliable, as content-length might lie. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think to remove this check for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, drop it. In Scrapy it's different, because by checking content-length Scrapy can prevent download before it happens. For decompression, there is also a special support in Scrapy; it's unrelated to content-length. Scrapy decompresses in chunks, and keeps track of the total size of decompressed data. If the size grows over the limit, an exception is raised, and decompression is stopped. See https://github.com/scrapy/scrapy/blob/6d65708cb7f7b49b72fc17486fecfc1caf62e0af/scrapy/utils/_compression.py#L53. This also looks like something we can't apply here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it, thanks! |
||
|
||
if expected_size is None or ( | ||
(maxsize and expected_size < maxsize) | ||
and (warnsize and expected_size < warnsize) | ||
): | ||
expected_size = len(b64decode(api_response.get("httpResponseBody", b""))) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to get size of base64 data without decoding it? Decoding can be costly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. *.75 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (assuming no linebreaks or ignoring them) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks fine not to be byte-precise. |
||
else: | ||
return False | ||
|
||
if expected_size is not None and not _check_response_size_limits( | ||
expected_size, warnsize, maxsize, request.url | ||
): | ||
return True | ||
|
||
return False | ||
|
||
|
||
def _process_response( | ||
api_response: _API_RESPONSE, | ||
request: Request, | ||
cookie_jars: Optional[Dict[Any, CookieJar]], | ||
default_maxsize: Optional[int], | ||
default_warnsize: Optional[int], | ||
) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]: | ||
"""Given a Zyte API Response and the ``scrapy.Request`` that asked for it, | ||
this returns either a ``ZyteAPITextResponse`` or ``ZyteAPIResponse`` depending | ||
|
@@ -184,6 +246,9 @@ def _process_response( | |
|
||
_process_cookies(api_response, request, cookie_jars) | ||
|
||
if _response_max_size_exceeded(api_response, request, default_maxsize, default_warnsize): | ||
return None | ||
|
||
if api_response.get("browserHtml"): | ||
# Using TextResponse because browserHtml always returns a browser-rendered page | ||
# even when requesting files (like images) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to copy the documentation here, we should just mention that these two standard Scrapy settings are supported.