-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support AnyResponse #161
support AnyResponse #161
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #161 +/- ##
==========================================
+ Coverage 98.43% 98.57% +0.14%
==========================================
Files 11 11
Lines 892 910 +18
==========================================
+ Hits 878 897 +19
+ Misses 14 13 -1
|
Added test cases and identified new scenarios that aren't handled properly, so the tests on providers would still break until the implementation is updated. |
bcba999
to
5f1d104
Compare
fa80f6a
to
b69580e
Compare
b69580e
to
3cb2290
Compare
scrapy_zyte_api/providers.py
Outdated
elif options_name in zyte_api_meta: | ||
extract_from = zyte_api_meta[options_name].get("extractFrom") | ||
elif item_type in to_provide_stripped and http_response_needed: | ||
zyte_api_meta[options_name] = {"extractFrom": "httpResponseBody"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check my understanding: the logic here is that if browserHtml is not requested, but a data type is requested, and there is AnyResponse, then we switch the extractFrom from default to httpResponseBody for this data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be, if:
- there's no explicit extraction source requested for the given
item_type
(e.g. Product, ProductNavigation, etc), and - AnyResponse is one of the requested, and
- either BrowserResponse or BrowserHtml is not requested, and
- HttpResponse has not been created by previous providers (i.e. HttpResponseProvider)
then we use httpResponseBody
as the extraction source.
scrapy_zyte_api/providers.py
Outdated
param_parser = _ParamParser(crawler) | ||
param_parser._transparent_mode = True | ||
http_request_params = param_parser.parse(request) | ||
del http_request_params["url"] | ||
zyte_api_meta.update(http_request_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, this logic is the least clear to me. Could you please add a comment, to explain a bit how it works?
- In the docs we write that
scrapy-poet integration ignores default parameters
. But it seems here they are applied, and ZYTE_API_PROVIDER_PARAMS are ignored instead? Or is it not happening because of some reason? - We parse the original request to get the additional keywords to add to zyte_api_meta. I wonder how it works in cases where the original request contains zyte_api_meta itself, and if behavior is different from how the provider works usually in such cases.
del http_request_params["url"]
is mysterious to me :) Why delete the url? Are there other parameters which need to be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here is to handle cookies, headers, etc. in a more consistent way, as compared to just setting httpRequestBody and httpRequestHeaders to True, without invoking ParamParser?
What are the actual differences? What breaks if ParamParser is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the docs we write that scrapy-poet integration ignores default parameters. But it seems here they are applied, and ZYTE_API_PROVIDER_PARAMS are ignored instead? Or is it not happening because of some reason?
That's a good point. I forgot about this and so using the ParamUser
was a way to make handling the headers consistent across the requests. I can't say for the actual differences in practice. For now, we can go with the simplest approach of setting httpResponseBody
and httpResponseHeaders
to True. b341976
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BurnzZ simply setting httpResponseBody/Headers looks way easier to understand, as it's similar to how all other parameters are handled.
@Gallaecio what do you think about this? Do you see any edge cases with using ParamParser vs just requesting a httpResponseBody/Headers? Any concerns about using the parameters directly, without ParamParser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only case I can think of, and would have been a problem already with the pre-existing code, is the case where a cookie included in the request is necessary to get the right content, or some actions are necessary for extraction to work properly, and the server side cannot (yet) inject those automatically.
Still, it may be better to keep things simple for now, and figure out how we wish to solve these issues when we get to that. Because even if we decide to use ParamParser, things are more complicated: it should only be used if automatic parameter parsing is being used, if (raw) zyte_api is used for the source request then some parameters may also need to be copied from there…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about the necessary cookies. Fortunately, I haven't encountered this yet in my experiments.
+1 on keeping things simple for now.
…i into http-or-browser-response
…i into http-or-browser-response
tests/test_providers.py
Outdated
results = yield provide({AnyResponse, Product}) | ||
assert len(results) == 2 | ||
assert type(results[0]) == AnyResponse | ||
assert type(results[1]) == Product |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be an assert for type(results[0].response) == HttpResponse
? Or should it be BrowserResponse here (i.e. the comment above is outdated)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this test since they duplicate the other test cases that follow it. Not to mention these also lack certain cases. df32f14
# The issue here is that HttpResponseProvider runs earlier than ScrapyZyteAPI. | ||
# HttpResponseProvider doesn't know that it should not run since ScrapyZyteAPI | ||
# could provide HttpResponse in anycase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the failure? Sorry, I haven't checked the logs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue happens if HttpResponse
is explicitly declared as a dependency, say in the PO.
Since HttpResponseProvider
runs much earlier than ZyteApiProvider
, it would make a request to ZAPI. When it's ZyteApiProvider
turn to fullfill dependencies, it would have a 2nd request to ZAPI to fulfill the AnyResponse
+ Product
dependencies.
The ideal scenario would be having only a single ZAPI request that would fulfill all three of the HttpResponse
, AnyResponse
, and Product
dependencies.
It could be the case that in another PR, we can combine HttpResponseProvider
and ZyteApiProvider
together, or perhaps create a decision mechanism to determine which providers should run that would result in a more optimal dependency creation.
I'm not sure how often this would occur in practice though, since if you have AnyResponse
, there's not much need to declare an HttpResponse
dependency. With that, it should be easy to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few comments, but it looks good overall @BurnzZ - great work! + to merge after updating to the released scrapy-poet.
Co-authored-by: Mikhail Korobov <[email protected]>
A way to address zytedata/zyte-spider-templates#25.
Related PRs:
TODO:
web-poet
scrapy-poet