-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document how to handle action failures #157
base: main
Are you sure you want to change the base?
Changes from 10 commits
7cc483b
47197ee
00b2c2a
20f3715
deedfff
75faf68
cdc0aa1
f19e157
62c9ae6
ea2da8f
e1eb2b1
12c7907
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
.. _actions: | ||
|
||
====================== | ||
Handling action errors | ||
====================== | ||
|
||
Even though Zyte API considers a response successful :ref:`even if a browser | ||
action fails <zyte-api-successful-responses>`, scrapy-zyte-api retries such | ||
responses by default. See :ref:`ZYTE_API_ACTION_ERROR_RETRY_ENABLED`. | ||
|
||
You can also use :ref:`ZYTE_API_ACTION_ERROR_HANDLING` to determine how such | ||
responses are handled when they are not retried or when retries are exceeded: | ||
treated as a success (default), ignored, or treated as an error. | ||
|
||
Action error caching | ||
==================== | ||
|
||
If you use | ||
:class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`, you might | ||
want to use a custom :setting:`HTTPCACHE_POLICY <scrapy:HTTPCACHE_POLICY>` to | ||
prevent responses with failed actions (i.e. after exceeding retries) to be | ||
cached: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any interaction with the ZYTE_API_ACTION_ERROR_HANDLING / ZYTE_API_ACTION_ERROR_RETRY_ENABLED settings? E.g. do you need this for all values option values? Or do some of these option values require caching to work? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess you only need this for the default handing, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was wrong. I tested all 6 scenarios, default policy/custom policy, pass/ignore/err, and pass/ignore/err make no difference, with the default policy responses are cached, with the custom policy they are not. Not sure if the current docs need any clarification, given the custom policy is relevant regardless of those settings. Test detailsI cloned https://github.com/zytedata/zyte-spider-templates-project and added 2 files: # zyte_spider_templates_project/spiders/test.py
from scrapy import Request, Spider
class TestSpider(Spider):
name = "test"
custom_settings = {
"HTTPCACHE_ENABLED": True,
}
def start_requests(self):
yield Request(
"https://toscrape.com",
meta={
"zyte_api_automap": {
"browserHtml": True,
"actions": [
{
"action": "waitForRequest",
"urlPattern": "unexising/pattern",
"timeout": 0,
},
],
},
},
)
def parse(self, response):
response_actions = getattr(response, "raw_api_response", {}).get("actions", None)
self.logger.debug(response_actions) # zyte_spider_templates_project/extensions.py
from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse
class ZyteAPIFailedActionsPolicy(DummyPolicy):
def should_cache_response(self, response: Response, request: Request):
if isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) and any(
"error" in action for action in response.raw_api_response.get("actions", [])
):
return False
return super().should_cache_response(response, request) And then tried all 6 scenarios. Default cache policy, default action error handling (
Default cache policy,
Default cache policy,
Custom cache policy, default action error handling (
Custom cache policy,
Custom cache policy,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (I’m not sure I fully understood your question, though, I might be missing some angle) |
||
|
||
.. code-block:: python | ||
:caption: myproject/extensions.py | ||
|
||
from scrapy import Request | ||
from scrapy.extensions.httpcache import DummyPolicy | ||
from scrapy.http import Response | ||
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse | ||
|
||
class ZyteAPIFailedActionsPolicy(DummyPolicy): | ||
|
||
def should_cache_response(self, response: Response, request: Request): | ||
if ( | ||
isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) | ||
and any("error" in action for action in response.raw_api_response.get("actions", [])) | ||
): | ||
return False | ||
return super().should_cache_response(response, request) | ||
|
||
And enable it in your settings: | ||
|
||
.. code-block:: python | ||
:caption: myproject/settings.py | ||
|
||
HTTPCACHE_POLICY = "myproject.extensions.ZyteAPIFailedActionsPolicy" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
from typing import Union | ||
|
||
from .responses import ZyteAPIResponse, ZyteAPITextResponse | ||
|
||
|
||
class ActionError(ValueError): | ||
"""Exception raised when a Zyte API response contains an action error.""" | ||
|
||
def __init__(self, response, *args, **kwargs): | ||
super().__init__(*args, **kwargs) | ||
|
||
#: Offending Zyte API response. | ||
#: | ||
#: You can inspect the outcome of actions in the ``"actions"`` key of | ||
#: :attr:`response.raw_api_response | ||
#: <scrapy_zyte_api.responses.ZyteAPITextResponse.raw_api_response>`. | ||
self.response: Union[ZyteAPIResponse, ZyteAPITextResponse] = response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of code in this section; I wonder if we should make it a part of the library itself instead.
It seems like a general theme: if action failed, should the response be considered a success or a failure? The current default is success, but it seems that making it a failure is better (with an option to get back), as it allows to detect more issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deedfff does this for the middleware. Do we want to do something about the cache policy, or is it OK to keep that documentation-only?