-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document how to handle action failures #157
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #157 +/- ##
==========================================
- Coverage 97.89% 97.40% -0.50%
==========================================
Files 14 15 +1
Lines 1618 1693 +75
Branches 297 309 +12
==========================================
+ Hits 1584 1649 +65
- Misses 14 20 +6
- Partials 20 24 +4
|
|
||
====================== | ||
Handling action errors | ||
====================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of code in this section; I wonder if we should make it a part of the library itself instead.
It seems like a general theme: if action failed, should the response be considered a success or a failure? The current default is success, but it seems that making it a failure is better (with an option to get back), as it allows to detect more issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deedfff does this for the middleware. Do we want to do something about the cache policy, or is it OK to keep that documentation-only?
:class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`, you might | ||
want to use a custom :setting:`HTTPCACHE_POLICY <scrapy:HTTPCACHE_POLICY>` to | ||
prevent responses with failed actions (i.e. after exceeding retries) to be | ||
cached: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any interaction with the ZYTE_API_ACTION_ERROR_HANDLING / ZYTE_API_ACTION_ERROR_RETRY_ENABLED settings? E.g. do you need this for all values option values? Or do some of these option values require caching to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you only need this for the default handing, "pass"
, as I don’t imagine caching caches exceptions. But I will test this out manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t imagine caching caches exceptions.
I was wrong.
I tested all 6 scenarios, default policy/custom policy, pass/ignore/err, and pass/ignore/err make no difference, with the default policy responses are cached, with the custom policy they are not. Not sure if the current docs need any clarification, given the custom policy is relevant regardless of those settings.
Test details
I cloned https://github.com/zytedata/zyte-spider-templates-project and added 2 files:
# zyte_spider_templates_project/spiders/test.py
from scrapy import Request, Spider
class TestSpider(Spider):
name = "test"
custom_settings = {
"HTTPCACHE_ENABLED": True,
}
def start_requests(self):
yield Request(
"https://toscrape.com",
meta={
"zyte_api_automap": {
"browserHtml": True,
"actions": [
{
"action": "waitForRequest",
"urlPattern": "unexising/pattern",
"timeout": 0,
},
],
},
},
)
def parse(self, response):
response_actions = getattr(response, "raw_api_response", {}).get("actions", None)
self.logger.debug(response_actions)
# zyte_spider_templates_project/extensions.py
from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse
class ZyteAPIFailedActionsPolicy(DummyPolicy):
def should_cache_response(self, response: Response, request: Request):
if isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) and any(
"error" in action for action in response.raw_api_response.get("actions", [])
):
return False
return super().should_cache_response(response, request)
And then tried all 6 scenarios.
Default cache policy, default action error handling ("pass"
), responses are cached:
$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:08 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:24 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/
Default cache policy, "ignore"
action error handling, responses are cached:
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:42 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/
Default cache policy, "err"
action error handling, responses are cached:
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:41 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:59 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/
Custom cache policy, default action error handling ("pass"
), responses are not cached:
$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:26 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:39 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/
Custom cache policy, "ignore"
action error handling, responses are not cached:
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:46 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/
Custom cache policy, "err"
action error handling, responses are not cached:
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:14 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:28 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I’m not sure I fully understood your question, though, I might be missing some angle)
Resolves #118, resolves #119, resolves #120.
To do: