Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to handle action failures #157

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Gallaecio
Copy link
Contributor

@Gallaecio Gallaecio commented Dec 29, 2023

Resolves #118, resolves #119, resolves #120.

To do:

  • Test both snippets manually, make sure they work as expected.

@Gallaecio Gallaecio requested review from kmike and wRAR December 29, 2023 13:55
Copy link

codecov bot commented Dec 29, 2023

Codecov Report

Attention: Patch coverage is 87.01299% with 10 lines in your changes missing coverage. Please review.

Project coverage is 97.40%. Comparing base (cc3cfdd) to head (12c7907).

Files with missing lines Patch % Lines
scrapy_zyte_api/_middlewares.py 85.91% 6 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #157      +/-   ##
==========================================
- Coverage   97.89%   97.40%   -0.50%     
==========================================
  Files          14       15       +1     
  Lines        1618     1693      +75     
  Branches      297      309      +12     
==========================================
+ Hits         1584     1649      +65     
- Misses         14       20       +6     
- Partials       20       24       +4     
Files with missing lines Coverage Δ
scrapy_zyte_api/exceptions.py 100.00% <100.00%> (ø)
scrapy_zyte_api/_middlewares.py 93.40% <85.91%> (-4.26%) ⬇️


======================
Handling action errors
======================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of code in this section; I wonder if we should make it a part of the library itself instead.

It seems like a general theme: if action failed, should the response be considered a success or a failure? The current default is success, but it seems that making it a failure is better (with an option to get back), as it allows to detect more issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deedfff does this for the middleware. Do we want to do something about the cache policy, or is it OK to keep that documentation-only?

:class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`, you might
want to use a custom :setting:`HTTPCACHE_POLICY <scrapy:HTTPCACHE_POLICY>` to
prevent responses with failed actions (i.e. after exceeding retries) to be
cached:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any interaction with the ZYTE_API_ACTION_ERROR_HANDLING / ZYTE_API_ACTION_ERROR_RETRY_ENABLED settings? E.g. do you need this for all values option values? Or do some of these option values require caching to work?

Copy link
Contributor Author

@Gallaecio Gallaecio Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you only need this for the default handing, "pass", as I don’t imagine caching caches exceptions. But I will test this out manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t imagine caching caches exceptions.

I was wrong.

I tested all 6 scenarios, default policy/custom policy, pass/ignore/err, and pass/ignore/err make no difference, with the default policy responses are cached, with the custom policy they are not. Not sure if the current docs need any clarification, given the custom policy is relevant regardless of those settings.

Test details

I cloned https://github.com/zytedata/zyte-spider-templates-project and added 2 files:

# zyte_spider_templates_project/spiders/test.py
from scrapy import Request, Spider


class TestSpider(Spider):
    name = "test"

    custom_settings = {
        "HTTPCACHE_ENABLED": True,
    }

    def start_requests(self):
        yield Request(
            "https://toscrape.com",
            meta={
                "zyte_api_automap": {
                    "browserHtml": True,
                    "actions": [
                        {
                            "action": "waitForRequest",
                            "urlPattern": "unexising/pattern",
                            "timeout": 0,
                        },
                    ],
                },
            },
        )

    def parse(self, response):
        response_actions = getattr(response, "raw_api_response", {}).get("actions", None)
        self.logger.debug(response_actions)
# zyte_spider_templates_project/extensions.py
from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse


class ZyteAPIFailedActionsPolicy(DummyPolicy):

    def should_cache_response(self, response: Response, request: Request):
        if isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) and any(
            "error" in action for action in response.raw_api_response.get("actions", [])
        ):
            return False
        return super().should_cache_response(response, request)

And then tried all 6 scenarios.

Default cache policy, default action error handling ("pass"), responses are cached:

$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:08 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:24 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/

Default cache policy, "ignore" action error handling, responses are cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:42 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/

Default cache policy, "err" action error handling, responses are cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:41 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:59 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/

Custom cache policy, default action error handling ("pass"), responses are not cached:

$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:26 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:39 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/

Custom cache policy, "ignore" action error handling, responses are not cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:46 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/

Custom cache policy, "err" action error handling, responses are not cached:

$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:14 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:28 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m not sure I fully understood your question, though, I might be missing some angle)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants