-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document how to handle action failures #157
base: main
Are you sure you want to change the base?
Changes from 3 commits
7cc483b
47197ee
00b2c2a
20f3715
deedfff
75faf68
cdc0aa1
f19e157
62c9ae6
ea2da8f
e1eb2b1
12c7907
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
.. _actions: | ||
|
||
====================== | ||
Handling action errors | ||
====================== | ||
|
||
Zyte API responses are considered successful :ref:`even if some browser actions | ||
fail <zyte-api-successful-responses>`. | ||
|
||
If you wish to retry requests whose response contains actions that failed, you | ||
can define a custom Scrapy middleware as follows: | ||
|
||
.. code-block:: python | ||
:caption: myproject/middlewares.py | ||
|
||
from typing import Optional, Type, Union | ||
|
||
from scrapy import Request, Spider | ||
from scrapy.crawler import Crawler | ||
from scrapy.http import Response | ||
from scrapy.downloadermiddlewares.retry import get_retry_request | ||
from scrapy.settings import BaseSettings | ||
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse | ||
|
||
class ZyteAPIFailedActionsRetryMiddleware: | ||
|
||
@classmethod | ||
def from_crawler(cls, crawler: Crawler): | ||
return cls(crawler.settings) | ||
|
||
def __init__(self, settings: BaseSettings): | ||
if not settings.getbool("RETRY_ENABLED"): | ||
raise NotConfigured | ||
self.max_retry_times = settings.getint("RETRY_TIMES") | ||
self.priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST") | ||
|
||
def process_response( | ||
self, request: Request, response: Response, spider: Spider | ||
) -> Union[Request, Response]: | ||
if not isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)): | ||
return response | ||
if request.meta.get("dont_retry", False): | ||
return response | ||
if any("error" in action for action in response.raw_api_response["actions"]): | ||
reason = "An action failed" | ||
new_request = self._retry(request, reason, spider) | ||
if new_request: | ||
return new_request | ||
else: | ||
return response | ||
# Note: If you prefer requests that exceed all retries to | ||
# be dropped, raise scrapy.exceptions.IgnoreRequest here, | ||
# instead of returning the response. | ||
return response | ||
|
||
def _retry( | ||
self, | ||
request: Request, | ||
reason: Union[str, Exception, Type[Exception]], | ||
spider: Spider, | ||
) -> Optional[Request]: | ||
max_retry_times = request.meta.get("max_retry_times", self.max_retry_times) | ||
priority_adjust = request.meta.get("priority_adjust", self.priority_adjust) | ||
return get_retry_request( | ||
request, | ||
reason=reason, | ||
spider=spider, | ||
max_retry_times=max_retry_times, | ||
priority_adjust=priority_adjust, | ||
) | ||
|
||
And enable it in your settings: | ||
|
||
.. code-block:: python | ||
:caption: myproject/settings.py | ||
|
||
|
||
DOWNLOADER_MIDDLEWARES = { | ||
"myproject.middlewares.ZyteAPIFailedActionsRetryMiddleware": 525, | ||
} | ||
|
||
If you use | ||
:class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`, you might | ||
want to use a custom :setting:`HTTPCACHE_POLICY <scrapy:HTTPCACHE_POLICY>` to | ||
prevent responses with failed actions (i.e. after exceeding retries) to be | ||
cached: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any interaction with the ZYTE_API_ACTION_ERROR_HANDLING / ZYTE_API_ACTION_ERROR_RETRY_ENABLED settings? E.g. do you need this for all values option values? Or do some of these option values require caching to work? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess you only need this for the default handing, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was wrong. I tested all 6 scenarios, default policy/custom policy, pass/ignore/err, and pass/ignore/err make no difference, with the default policy responses are cached, with the custom policy they are not. Not sure if the current docs need any clarification, given the custom policy is relevant regardless of those settings. Test detailsI cloned https://github.com/zytedata/zyte-spider-templates-project and added 2 files: # zyte_spider_templates_project/spiders/test.py
from scrapy import Request, Spider
class TestSpider(Spider):
name = "test"
custom_settings = {
"HTTPCACHE_ENABLED": True,
}
def start_requests(self):
yield Request(
"https://toscrape.com",
meta={
"zyte_api_automap": {
"browserHtml": True,
"actions": [
{
"action": "waitForRequest",
"urlPattern": "unexising/pattern",
"timeout": 0,
},
],
},
},
)
def parse(self, response):
response_actions = getattr(response, "raw_api_response", {}).get("actions", None)
self.logger.debug(response_actions) # zyte_spider_templates_project/extensions.py
from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse
class ZyteAPIFailedActionsPolicy(DummyPolicy):
def should_cache_response(self, response: Response, request: Request):
if isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) and any(
"error" in action for action in response.raw_api_response.get("actions", [])
):
return False
return super().should_cache_response(response, request) And then tried all 6 scenarios. Default cache policy, default action error handling (
Default cache policy,
Default cache policy,
Custom cache policy, default action error handling (
Custom cache policy,
Custom cache policy,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (I’m not sure I fully understood your question, though, I might be missing some angle) |
||
|
||
.. code-block:: python | ||
:caption: myproject/extensions.py | ||
|
||
from scrapy import Request | ||
from scrapy.extensions.httpcache import DummyPolicy | ||
from scrapy.http import Response | ||
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse | ||
|
||
class ZyteAPIFailedActionsPolicy(DummyPolicy): | ||
|
||
def should_cache_response(self, response: Response, request: Request): | ||
if ( | ||
isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) | ||
and any("error" in action for action in response.raw_api_response["actions"]) | ||
): | ||
return False | ||
return super().should_cache_response(response, request) | ||
|
||
And enable it in your settings: | ||
|
||
.. code-block:: python | ||
:caption: myproject/settings.py | ||
|
||
HTTPCACHE_POLICY = "myproject.extensions.ZyteAPIFailedActionsPolicy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of code in this section; I wonder if we should make it a part of the library itself instead.
It seems like a general theme: if action failed, should the response be considered a success or a failure? The current default is success, but it seems that making it a failure is better (with an option to get back), as it allows to detect more issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deedfff does this for the middleware. Do we want to do something about the cache policy, or is it OK to keep that documentation-only?