Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE #227

Merged
merged 22 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions docs/reference/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,42 @@ when :setting:`ZYTE_API_LOG_REQUESTS` is enabled, excluding object keys.
To disable truncation, set this to ``0``.


.. setting:: DOWNLOAD_MAXSIZE

DOWNLOAD_MAXSIZE
================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to copy the documentation here, we should just mention that these two standard Scrapy settings are supported.


Default: ``1073741824`` (1 GiB)

The maximum response body size (in bytes) allowed. Bigger responses are
aborted and ignored.

This applies both before and after compression. If decompressing a response
body would exceed this limit, decompression is aborted and the response is
ignored.

Use ``0`` to disable this limit.

This limit can be set per spider using the :attr:`download_maxsize` spider
attribute and per request using the :reqmeta:`download_maxsize` Request.meta
key.

.. setting:: DOWNLOAD_WARNSIZE

DOWNLOAD_WARNSIZE
=================

Default: ``33554432`` (32 MiB)

If the size of a response exceeds this value, before or after compression, a
warning will be logged about it.

Use ``0`` to disable this limit.

This limit can be set per spider using the :attr:`download_warnsize` spider
attribute and per request using the :reqmeta:`download_warnsize` Request.meta
key.

.. setting:: ZYTE_API_MAX_COOKIES

ZYTE_API_MAX_COOKIES
Expand Down
7 changes: 6 additions & 1 deletion scrapy_zyte_api/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,9 @@ def __init__(
f"({self._truncate_limit}) is invalid. It must be 0 or a "
f"positive integer."
)
self._default_maxsize = settings.getint("DOWNLOAD_MAXSIZE")
self._default_warnsize = settings.getint("DOWNLOAD_WARNSIZE")

crawler.signals.connect(self.engine_started, signal=signals.engine_started)
self._crawler = crawler
self._fallback_handler = None
Expand Down Expand Up @@ -231,7 +234,9 @@ async def _download_request(
finally:
self._update_stats(api_params)

return _process_response(api_response, request, self._cookie_jars)
return _process_response(
api_response, request, self._cookie_jars, self._default_maxsize, self._default_warnsize
kmike marked this conversation as resolved.
Show resolved Hide resolved
)

def _process_request_error(self, request, error):
detail = (error.parsed.data or {}).get("detail", error.message)
Expand Down
65 changes: 65 additions & 0 deletions scrapy_zyte_api/responses.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import logging

from base64 import b64decode
from copy import copy
from datetime import datetime
Expand All @@ -15,6 +17,9 @@
_RESPONSE_HAS_PROTOCOL,
)

logger = logging.getLogger(__name__)


_DEFAULT_ENCODING = "utf-8"


Expand Down Expand Up @@ -166,10 +171,67 @@ def from_api_response(cls, api_response: Dict, *, request: Request = None):
_API_RESPONSE = Dict[str, _JSON]


def _check_response_size_limits(
expected_size: int,
warnsize: Optional[int],
maxsize: Optional[int],
request_url: str,
) -> bool:
if warnsize and expected_size > warnsize:
logger.warning(
f"Expected response size {expected_size} larger than "
f"download warn size {warnsize} in request {request_url}."
)

if maxsize and expected_size > maxsize:
logger.warning(
f"Cancelling download of {request_url}: expected response size "
f"{expected_size} larger than download max size {maxsize}."
)
return False
return True


def _response_max_size_exceeded(
api_response: _API_RESPONSE,
request: Request,
default_maxsize: Optional[int],
default_warnsize: Optional[int],
) -> bool:
maxsize = request.meta.get("download_maxsize", default_maxsize)
warnsize = request.meta.get("download_warnsize", default_warnsize)

if "browserHtml" in api_response:
expected_size = len(api_response["browserHtml"].encode(_DEFAULT_ENCODING))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here while trying to limit the memory using DOWNLOAD_MAXSIZE we might create an additional memory spike, because we create another duplicate of browserHtml in memory, temporarily.

It seems we need to ensure _response_max_size_exceeded never makes copies of large data from the response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can either use sys.getsizeof (and maybe subtract some fixed overhead Python Unicode objects have), or consider the length of the unicode object instead of the length of the actual data, it could be good enough as well (though worse). Maybe there is some other solution.

Copy link
Collaborator Author

@PyExplorer PyExplorer Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also calculate the size of the response body here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L114
and
here
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L145
What do you think if we just move the check into these two functions and check separately? Then return None of the size is big.
In this case we only need to check additionally content-length in ZyteAPIResponse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@PyExplorer PyExplorer Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach is to calculate the size and decoded/encoded version of the response body here before calling from_api_response and send the prepared body to from_api_response. In this case we make this expensive calculation only once and use this calculated body here too https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/responses.py#L197.

Copy link
Member

@kmike kmike Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about moving it down the stack - check the size of the API response received by the client library before the json decoding.

But it could make the library less compatible with Scrapy. Let's say you have an existing spider, which uses some download limit. You switch to scrapy-zyte-api for downloads, and maybe also enable data extraction. But the API response is larger than that raw response size. So, the limit becomes more aggressive, and you might drop some pages which were working before.

Because of this, the approach you've taken originally - checking httpResponseBody size and browserHtml size, ignoring everything else (e.g. structured data sizes, or screenshot sizes) makes sense to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@PyExplorer PyExplorer Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike a new version is here (in this PR).
Now the number of encoding/decoding operations is the same as before. No additional calculations except for getting the length.

elif api_response.get("httpResponseHeaders") and api_response.get("httpResponseBody"):
expected_size = None
for header in api_response.get("httpResponseHeaders"):
if header["name"].lower() == "content-length":
expected_size = int(header["value"])
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is computing expected size needed? We already got the httpResponseBody, and can check its real size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to check it first (as this is faster) and if "content-length" is exceed the limit - return without checking the length of the real body.

Copy link
Member

@kmike kmike Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't decode from base64, but use 0.75 approximation, then using content-length will not be any faster - it'd be slower, and also less reliable, as content-length might lie.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think to remove this check for content-length at all?
Actually, I've added this as we consider (and in Scrapy this is also mentioned for DOWNLOAD_MAXSIZE) to check compressed and decompressed data, and only one way that I found how to check compressed data was to check content-length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, drop it. In Scrapy it's different, because by checking content-length Scrapy can prevent download before it happens.

For decompression, there is also a special support in Scrapy; it's unrelated to content-length. Scrapy decompresses in chunks, and keeps track of the total size of decompressed data. If the size grows over the limit, an exception is raised, and decompression is stopped. See https://github.com/scrapy/scrapy/blob/6d65708cb7f7b49b72fc17486fecfc1caf62e0af/scrapy/utils/_compression.py#L53. This also looks like something we can't apply here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!


if expected_size is None or (
(maxsize and expected_size < maxsize)
and (warnsize and expected_size < warnsize)
):
expected_size = len(b64decode(api_response.get("httpResponseBody", b"")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to get size of base64 data without decoding it? Decoding can be costly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*.75

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(assuming no linebreaks or ignoring them)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks fine not to be byte-precise.

else:
return False

if expected_size is not None and not _check_response_size_limits(
expected_size, warnsize, maxsize, request.url
):
return True

return False


def _process_response(
api_response: _API_RESPONSE,
request: Request,
cookie_jars: Optional[Dict[Any, CookieJar]],
default_maxsize: Optional[int],
default_warnsize: Optional[int],
) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]:
"""Given a Zyte API Response and the ``scrapy.Request`` that asked for it,
this returns either a ``ZyteAPITextResponse`` or ``ZyteAPIResponse`` depending
Expand All @@ -184,6 +246,9 @@ def _process_response(

_process_cookies(api_response, request, cookie_jars)

if _response_max_size_exceeded(api_response, request, default_maxsize, default_warnsize):
return None

if api_response.get("browserHtml"):
# Using TextResponse because browserHtml always returns a browser-rendered page
# even when requesting files (like images)
Expand Down
Loading