Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to handle action failures #157

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ either :ref:`globally <transparent>` or :ref:`per request <automap>`, or
usage/scrapy-poet
usage/session
usage/stats
usage/actions
usage/fingerprint
usage/proxy

Expand Down
39 changes: 39 additions & 0 deletions docs/reference/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,44 @@ Settings

:ref:`Settings <topics-settings>` for scrapy-zyte-api.

.. setting:: ZYTE_API_ACTION_ERROR_RETRY_ENABLED

ZYTE_API_ACTION_ERROR_RETRY_ENABLED
===================================

Default: ``True``

Enables retries of Zyte API requests if responses contain an action error.

Maximum retries and priority adjustment are handled with the same settings and
request metadata keywords as regular retries in Scrapy, see
:setting:`RETRY_TIMES <scrapy:RETRY_TIMES>` and
:setting:`RETRY_PRIORITY_ADJUST <scrapy:RETRY_PRIORITY_ADJUST>`.

See also :setting:`ZYTE_API_ACTION_ERROR_HANDLING`.


.. setting:: ZYTE_API_ACTION_ERROR_HANDLING

ZYTE_API_ACTION_ERROR_HANDLING
==============================

Default: ``"pass"``

Determines how to handle Zyte API responses that contain an action error:

- ``"pass"``: Responses are treated as valid responses, i.e. they will reach
your spider callback.

- ``"ignore"``: Responses are dropped, i.e. they will not reach your spider.

- ``"err"``: :class:`~scrapy_zyte_api.exceptions.ActionError` is raised, and
will reach your spider errback if you set one.

.. autoexception:: scrapy_zyte_api.exceptions.ActionError
:members:


.. setting:: ZYTE_API_AUTO_FIELD_STATS

ZYTE_API_AUTO_FIELD_STATS
Expand Down Expand Up @@ -40,6 +78,7 @@ If for any request a custom page object class is used to override some
.. note:: :func:`zyte_common_items.fields.is_auto_field` is used to determine
whether a field has been overridden or not.


.. setting:: ZYTE_API_AUTOMAP_PARAMS

ZYTE_API_AUTOMAP_PARAMS
Expand Down
47 changes: 47 additions & 0 deletions docs/usage/actions.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
.. _actions:

======================
Handling action errors
======================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of code in this section; I wonder if we should make it a part of the library itself instead.

It seems like a general theme: if action failed, should the response be considered a success or a failure? The current default is success, but it seems that making it a failure is better (with an option to get back), as it allows to detect more issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deedfff does this for the middleware. Do we want to do something about the cache policy, or is it OK to keep that documentation-only?


Even though Zyte API considers a response successful :ref:`even if a browser
action fails <zapi-successful-responses>`, scrapy-zyte-api retries such
responses by default. See :setting:`ZYTE_API_ACTION_ERROR_RETRY_ENABLED`.

You can also use :setting:`ZYTE_API_ACTION_ERROR_HANDLING` to determine how
such responses are handled when they are not retried or when retries are
exceeded: treated as a success (default), ignored, or treated as an error.

Action error caching
====================

If you use
:class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`, you might
want to use a custom :setting:`HTTPCACHE_POLICY <scrapy:HTTPCACHE_POLICY>` to
prevent responses with failed actions (i.e. after exceeding retries) to be
cached:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any interaction with the ZYTE_API_ACTION_ERROR_HANDLING / ZYTE_API_ACTION_ERROR_RETRY_ENABLED settings? E.g. do you need this for all values option values? Or do some of these option values require caching to work?

Copy link
Contributor Author

@Gallaecio Gallaecio Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you only need this for the default handing, "pass", as I don’t imagine caching caches exceptions. But I will test this out manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t imagine caching caches exceptions.

I was wrong.

I tested all 6 scenarios, default policy/custom policy, pass/ignore/err, and pass/ignore/err make no difference, with the default policy responses are cached, with the custom policy they are not. Not sure if the current docs need any clarification, given the custom policy is relevant regardless of those settings.

Test details

I cloned https://github.com/zytedata/zyte-spider-templates-project and added 2 files:

# zyte_spider_templates_project/spiders/test.py
from scrapy import Request, Spider


class TestSpider(Spider):
    name = "test"

    custom_settings = {
        "HTTPCACHE_ENABLED": True,
    }

    def start_requests(self):
        yield Request(
            "https://toscrape.com",
            meta={
                "zyte_api_automap": {
                    "browserHtml": True,
                    "actions": [
                        {
                            "action": "waitForRequest",
                            "urlPattern": "unexising/pattern",
                            "timeout": 0,
                        },
                    ],
                },
            },
        )

    def parse(self, response):
        response_actions = getattr(response, "raw_api_response", {}).get("actions", None)
        self.logger.debug(response_actions)
# zyte_spider_templates_project/extensions.py
from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse


class ZyteAPIFailedActionsPolicy(DummyPolicy):

    def should_cache_response(self, response: Response, request: Request):
        if isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) and any(
            "error" in action for action in response.raw_api_response.get("actions", [])
        ):
            return False
        return super().should_cache_response(response, request)

And then tried all 6 scenarios.

Default cache policy, default action error handling ("pass"), responses are cached:

$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:08 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:24 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/

Default cache policy, "ignore" action error handling, responses are cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:42 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/

Default cache policy, "err" action error handling, responses are cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:41 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:59 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/

Custom cache policy, default action error handling ("pass"), responses are not cached:

$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:26 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:39 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/

Custom cache policy, "ignore" action error handling, responses are not cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:46 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/

Custom cache policy, "err" action error handling, responses are not cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:14 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:28 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m not sure I fully understood your question, though, I might be missing some angle)


.. code-block:: python
:caption: myproject/extensions.py

from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse


class ZyteAPIFailedActionsPolicy(DummyPolicy):

def should_cache_response(self, response: Response, request: Request):
if isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) and any(
"error" in action for action in response.raw_api_response.get("actions", [])
):
return False
return super().should_cache_response(response, request)

And enable it in your settings:

.. code-block:: python
:caption: myproject/settings.py

HTTPCACHE_POLICY = "myproject.extensions.ZyteAPIFailedActionsPolicy"
6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@ files = [
]

[tool.pytest.ini_options]
filterwarnings = [
"ignore:::twisted",
]
junit_family = "xunit2"
testpaths = [
"scrapy_zyte_api/",
"tests/"
]
minversion = "6.0"
filterwarnings = [
"ignore::DeprecationWarning:twisted.web.http",
]
174 changes: 172 additions & 2 deletions scrapy_zyte_api/_middlewares.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,116 @@
from logging import getLogger
from typing import cast
from typing import Optional, Union, cast

from scrapy import Request
from scrapy import Request, Spider
from scrapy.exceptions import IgnoreRequest
from scrapy.http import Response
from scrapy.utils.python import global_object_name
from zyte_api import RequestError

from ._params import _ParamParser
from .exceptions import ActionError
from .responses import ZyteAPIResponse, ZyteAPITextResponse

try:
from scrapy.downloadermiddlewares.retry import get_retry_request
except ImportError:
# Backport get_retry_request for Scrapy < 2.5.0

from logging import Logger
from typing import Type

from scrapy.downloadermiddlewares.retry import ( # type: ignore[attr-defined] # isort: skip
logger as retry_logger,
)

def get_retry_request(
request: Request,
*,
spider: Spider,
reason: Union[str, Exception, Type[Exception]] = "unspecified",
max_retry_times: Optional[int] = None,
priority_adjust: Optional[int] = None,
logger: Logger = retry_logger,
stats_base_key: str = "retry",
) -> Optional[Request]:
"""
Returns a new :class:`~scrapy.Request` object to retry the specified
request, or ``None`` if retries of the specified request have been
exhausted.

For example, in a :class:`~scrapy.Spider` callback, you could use it as
follows::

def parse(self, response):
if not response.text:
new_request_or_none = get_retry_request(
response.request,
spider=self,
reason='empty',
)
return new_request_or_none

*spider* is the :class:`~scrapy.Spider` instance which is asking for the
retry request. It is used to access the :ref:`settings <topics-settings>`
and :ref:`stats <topics-stats>`, and to provide extra logging context (see
:func:`logging.debug`).

*reason* is a string or an :class:`Exception` object that indicates the
reason why the request needs to be retried. It is used to name retry stats.

*max_retry_times* is a number that determines the maximum number of times
that *request* can be retried. If not specified or ``None``, the number is
read from the :reqmeta:`max_retry_times` meta key of the request. If the
:reqmeta:`max_retry_times` meta key is not defined or ``None``, the number
is read from the :setting:`RETRY_TIMES` setting.

*priority_adjust* is a number that determines how the priority of the new
request changes in relation to *request*. If not specified, the number is
read from the :setting:`RETRY_PRIORITY_ADJUST` setting.

*logger* is the logging.Logger object to be used when logging messages

*stats_base_key* is a string to be used as the base key for the
retry-related job stats
"""
settings = spider.crawler.settings
assert spider.crawler.stats
stats = spider.crawler.stats
retry_times = request.meta.get("retry_times", 0) + 1
if max_retry_times is None:
max_retry_times = request.meta.get("max_retry_times")

Check warning on line 81 in scrapy_zyte_api/_middlewares.py

View check run for this annotation

Codecov / codecov/patch

scrapy_zyte_api/_middlewares.py#L81

Added line #L81 was not covered by tests
if max_retry_times is None:
max_retry_times = settings.getint("RETRY_TIMES")

Check warning on line 83 in scrapy_zyte_api/_middlewares.py

View check run for this annotation

Codecov / codecov/patch

scrapy_zyte_api/_middlewares.py#L83

Added line #L83 was not covered by tests
if retry_times <= max_retry_times:
logger.debug(
"Retrying %(request)s (failed %(retry_times)d times): %(reason)s",
{"request": request, "retry_times": retry_times, "reason": reason},
extra={"spider": spider},
)
new_request: Request = request.copy()
new_request.meta["retry_times"] = retry_times
new_request.dont_filter = True
if priority_adjust is None:
priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST")

Check warning on line 94 in scrapy_zyte_api/_middlewares.py

View check run for this annotation

Codecov / codecov/patch

scrapy_zyte_api/_middlewares.py#L94

Added line #L94 was not covered by tests
new_request.priority = request.priority + priority_adjust

if callable(reason):
reason = reason()

Check warning on line 98 in scrapy_zyte_api/_middlewares.py

View check run for this annotation

Codecov / codecov/patch

scrapy_zyte_api/_middlewares.py#L98

Added line #L98 was not covered by tests
if isinstance(reason, Exception):
reason = global_object_name(reason.__class__)

Check warning on line 100 in scrapy_zyte_api/_middlewares.py

View check run for this annotation

Codecov / codecov/patch

scrapy_zyte_api/_middlewares.py#L100

Added line #L100 was not covered by tests

stats.inc_value(f"{stats_base_key}/count")
stats.inc_value(f"{stats_base_key}/reason_count/{reason}")
return new_request
stats.inc_value(f"{stats_base_key}/max_reached")
logger.error(
"Gave up retrying %(request)s (failed %(retry_times)d times): "
"%(reason)s",
{"request": request, "retry_times": retry_times, "reason": reason},
extra={"spider": spider},
)
return None


logger = getLogger(__name__)
_start_requests_processed = object()
Expand Down Expand Up @@ -49,6 +154,13 @@
self._forbidden_domain_start_request_count = 0
self._total_start_request_count = 0

self._retry_action_errors = crawler.settings.getbool(
"ZYTE_API_ACTION_ERROR_RETRY_ENABLED", True
)
self._max_retry_times = crawler.settings.getint("RETRY_TIMES")
self._priority_adjust = crawler.settings.getint("RETRY_PRIORITY_ADJUST")
self._load_action_error_handling()

self._max_requests = crawler.settings.getint("ZYTE_API_MAX_REQUESTS")
if self._max_requests:
logger.info(
Expand All @@ -62,6 +174,18 @@
self._start_requests_processed, signal=_start_requests_processed
)

def _load_action_error_handling(self):
value = self._crawler.settings.get("ZYTE_API_ACTION_ERROR_HANDLING", "pass")
if value in ("pass", "ignore", "err"):
self._action_error_handling = value
else:
fallback_value = "pass"
logger.error(
f"Setting ZYTE_API_ACTION_ERROR_HANDLING got an unexpected "
f"value: {value!r}. Falling back to {fallback_value!r}."
)
self._action_error_handling = fallback_value

def _get_spm_mw(self):
spm_mw_classes = []

Expand Down Expand Up @@ -162,6 +286,52 @@
self._crawler.spider, "failed_forbidden_domain"
)

def _handle_action_error(self, response):
if self._action_error_handling == "pass":
return response
elif self._action_error_handling == "ignore":
raise IgnoreRequest
else:
assert self._action_error_handling == "err"
raise ActionError(response)

def process_response(
self, request: Request, response: Response, spider: Spider
) -> Union[Request, Response]:
if not isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)):
return response

assert response.raw_api_response is not None
action_error = any(
"error" in action for action in response.raw_api_response.get("actions", [])
)
if not action_error:
return response

if not self._retry_action_errors or request.meta.get("dont_retry", False):
return self._handle_action_error(response)

return self._retry(
request, reason="action-error", spider=spider
) or self._handle_action_error(response)

def _retry(
self,
request: Request,
*,
reason: str,
spider: Spider,
) -> Optional[Request]:
max_retry_times = request.meta.get("max_retry_times", self._max_retry_times)
priority_adjust = request.meta.get("priority_adjust", self._priority_adjust)
return get_retry_request(
request,
reason=reason,
spider=spider,
max_retry_times=max_retry_times,
priority_adjust=priority_adjust,
)


class ScrapyZyteAPISpiderMiddleware(_BaseMiddleware):
def __init__(self, crawler):
Expand Down
17 changes: 17 additions & 0 deletions scrapy_zyte_api/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from typing import Union

from .responses import ZyteAPIResponse, ZyteAPITextResponse


class ActionError(ValueError):
"""Exception raised when a Zyte API response contains an action error."""

def __init__(self, response, *args, **kwargs):
super().__init__(*args, **kwargs)

#: Offending Zyte API response.
#:
#: You can inspect the outcome of actions in the ``"actions"`` key of
#: :attr:`response.raw_api_response
#: <scrapy_zyte_api.responses.ZyteAPITextResponse.raw_api_response>`.
self.response: Union[ZyteAPIResponse, ZyteAPITextResponse] = response
Loading
Loading