Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to handle action failures #157

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ either :ref:`globally <transparent>` or :ref:`per request <automap>`, or
usage/retry
usage/scrapy-poet
usage/stats
usage/actions
usage/fingerprint
usage/proxy

Expand Down
111 changes: 111 additions & 0 deletions docs/usage/actions.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
.. _actions:

======================
Handling action errors
======================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of code in this section; I wonder if we should make it a part of the library itself instead.

It seems like a general theme: if action failed, should the response be considered a success or a failure? The current default is success, but it seems that making it a failure is better (with an option to get back), as it allows to detect more issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deedfff does this for the middleware. Do we want to do something about the cache policy, or is it OK to keep that documentation-only?


Zyte API responses are considered successful :ref:`even if some browser actions
fail <zyte-api-successful-responses>`.

If you wish to retry requests whose response contains actions that failed, you
can define a custom Scrapy middleware as follows:

.. code-block:: python
:caption: myproject/middlewares.py

from typing import Optional, Type, Union

from scrapy import Request, Spider
from scrapy.crawler import Crawler
from scrapy.http import Response
from scrapy.downloadermiddlewares.retry import get_retry_request
from scrapy.settings import BaseSettings
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse

class ZyteAPIFailedActionsRetryMiddleware:

@classmethod
def from_crawler(cls, crawler: Crawler):
return cls(crawler.settings)

def __init__(self, settings: BaseSettings):
if not settings.getbool("RETRY_ENABLED"):
raise NotConfigured
self.max_retry_times = settings.getint("RETRY_TIMES")
self.priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST")

def process_response(
self, request: Request, response: Response, spider: Spider
) -> Union[Request, Response]:
if not isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)):
return response
if request.meta.get("dont_retry", False):
return response
if any("error" in action for action in response.raw_api_response["actions"]):
reason = "An action failed"
new_request = self._retry(request, reason, spider)
if new_request:
return new_request
else:
return response
# Note: If you prefer requests that exceed all retries to
# be dropped, raise scrapy.exceptions.IgnoreRequest here,
# instead of returning the response.
return response

def _retry(
self,
request: Request,
reason: Union[str, Exception, Type[Exception]],
spider: Spider,
) -> Optional[Request]:
max_retry_times = request.meta.get("max_retry_times", self.max_retry_times)
priority_adjust = request.meta.get("priority_adjust", self.priority_adjust)
return get_retry_request(
request,
reason=reason,
spider=spider,
max_retry_times=max_retry_times,
priority_adjust=priority_adjust,
)

And enable it in your settings:

.. code-block:: python
:caption: myproject/settings.py


DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.ZyteAPIFailedActionsRetryMiddleware": 525,
}

If you use
:class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`, you might
want to use a custom :setting:`HTTPCACHE_POLICY <scrapy:HTTPCACHE_POLICY>` to
prevent responses with failed actions (i.e. after exceeding retries) to be
cached:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any interaction with the ZYTE_API_ACTION_ERROR_HANDLING / ZYTE_API_ACTION_ERROR_RETRY_ENABLED settings? E.g. do you need this for all values option values? Or do some of these option values require caching to work?

Copy link
Contributor Author

@Gallaecio Gallaecio Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you only need this for the default handing, "pass", as I don’t imagine caching caches exceptions. But I will test this out manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t imagine caching caches exceptions.

I was wrong.

I tested all 6 scenarios, default policy/custom policy, pass/ignore/err, and pass/ignore/err make no difference, with the default policy responses are cached, with the custom policy they are not. Not sure if the current docs need any clarification, given the custom policy is relevant regardless of those settings.

Test details

I cloned https://github.com/zytedata/zyte-spider-templates-project and added 2 files:

# zyte_spider_templates_project/spiders/test.py
from scrapy import Request, Spider


class TestSpider(Spider):
    name = "test"

    custom_settings = {
        "HTTPCACHE_ENABLED": True,
    }

    def start_requests(self):
        yield Request(
            "https://toscrape.com",
            meta={
                "zyte_api_automap": {
                    "browserHtml": True,
                    "actions": [
                        {
                            "action": "waitForRequest",
                            "urlPattern": "unexising/pattern",
                            "timeout": 0,
                        },
                    ],
                },
            },
        )

    def parse(self, response):
        response_actions = getattr(response, "raw_api_response", {}).get("actions", None)
        self.logger.debug(response_actions)
# zyte_spider_templates_project/extensions.py
from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse


class ZyteAPIFailedActionsPolicy(DummyPolicy):

    def should_cache_response(self, response: Response, request: Request):
        if isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) and any(
            "error" in action for action in response.raw_api_response.get("actions", [])
        ):
            return False
        return super().should_cache_response(response, request)

And then tried all 6 scenarios.

Default cache policy, default action error handling ("pass"), responses are cached:

$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:08 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:24 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/

Default cache policy, "ignore" action error handling, responses are cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:42 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/

Default cache policy, "err" action error handling, responses are cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:41 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:59 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/

Custom cache policy, default action error handling ("pass"), responses are not cached:

$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:26 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:39 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/

Custom cache policy, "ignore" action error handling, responses are not cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:46 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/

Custom cache policy, "err" action error handling, responses are not cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:14 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:28 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m not sure I fully understood your question, though, I might be missing some angle)


.. code-block:: python
:caption: myproject/extensions.py

from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse

class ZyteAPIFailedActionsPolicy(DummyPolicy):

def should_cache_response(self, response: Response, request: Request):
if (
isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse))
and any("error" in action for action in response.raw_api_response["actions"])
):
return False
return super().should_cache_response(response, request)

And enable it in your settings:

.. code-block:: python
:caption: myproject/settings.py

HTTPCACHE_POLICY = "myproject.extensions.ZyteAPIFailedActionsPolicy"
Loading