Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support AnyResponse #161

Merged
merged 25 commits into from
Feb 8, 2024
Merged

support AnyResponse #161

merged 25 commits into from
Feb 8, 2024

Conversation

BurnzZ
Copy link
Member

@BurnzZ BurnzZ commented Jan 16, 2024

Copy link

codecov bot commented Jan 16, 2024

Codecov Report

Merging #161 (382dced) into main (4adedc8) will increase coverage by 0.14%.
Report is 3 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #161      +/-   ##
==========================================
+ Coverage   98.43%   98.57%   +0.14%     
==========================================
  Files          11       11              
  Lines         892      910      +18     
==========================================
+ Hits          878      897      +19     
+ Misses         14       13       -1     
Files Coverage Δ
scrapy_zyte_api/providers.py 98.31% <100.00%> (+1.26%) ⬆️

... and 1 file with indirect coverage changes

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved
scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved
scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved
@BurnzZ BurnzZ changed the title POC: support HttpOrBrowserResponse POC: support AnyResponse Jan 17, 2024
@BurnzZ
Copy link
Member Author

BurnzZ commented Jan 17, 2024

Added test cases and identified new scenarios that aren't handled properly, so the tests on providers would still break until the implementation is updated.

@BurnzZ BurnzZ changed the title POC: support AnyResponse support AnyResponse Jan 18, 2024
@BurnzZ BurnzZ marked this pull request as ready for review January 18, 2024 14:24
@BurnzZ BurnzZ requested a review from wRAR January 18, 2024 14:31
tests/test_providers.py Outdated Show resolved Hide resolved
elif options_name in zyte_api_meta:
extract_from = zyte_api_meta[options_name].get("extractFrom")
elif item_type in to_provide_stripped and http_response_needed:
zyte_api_meta[options_name] = {"extractFrom": "httpResponseBody"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check my understanding: the logic here is that if browserHtml is not requested, but a data type is requested, and there is AnyResponse, then we switch the extractFrom from default to httpResponseBody for this data type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be, if:

  • there's no explicit extraction source requested for the given item_type (e.g. Product, ProductNavigation, etc), and
  • AnyResponse is one of the requested, and
  • either BrowserResponse or BrowserHtml is not requested, and
  • HttpResponse has not been created by previous providers (i.e. HttpResponseProvider)

then we use httpResponseBody as the extraction source.

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved
Comment on lines 149 to 153
param_parser = _ParamParser(crawler)
param_parser._transparent_mode = True
http_request_params = param_parser.parse(request)
del http_request_params["url"]
zyte_api_meta.update(http_request_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, this logic is the least clear to me. Could you please add a comment, to explain a bit how it works?

  1. In the docs we write that scrapy-poet integration ignores default parameters. But it seems here they are applied, and ZYTE_API_PROVIDER_PARAMS are ignored instead? Or is it not happening because of some reason?
  2. We parse the original request to get the additional keywords to add to zyte_api_meta. I wonder how it works in cases where the original request contains zyte_api_meta itself, and if behavior is different from how the provider works usually in such cases.
  3. del http_request_params["url"] is mysterious to me :) Why delete the url? Are there other parameters which need to be deleted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here is to handle cookies, headers, etc. in a more consistent way, as compared to just setting httpRequestBody and httpRequestHeaders to True, without invoking ParamParser?

What are the actual differences? What breaks if ParamParser is not used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docs we write that scrapy-poet integration ignores default parameters. But it seems here they are applied, and ZYTE_API_PROVIDER_PARAMS are ignored instead? Or is it not happening because of some reason?

That's a good point. I forgot about this and so using the ParamUser was a way to make handling the headers consistent across the requests. I can't say for the actual differences in practice. For now, we can go with the simplest approach of setting httpResponseBody and httpResponseHeaders to True. b341976

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BurnzZ simply setting httpResponseBody/Headers looks way easier to understand, as it's similar to how all other parameters are handled.

@Gallaecio what do you think about this? Do you see any edge cases with using ParamParser vs just requesting a httpResponseBody/Headers? Any concerns about using the parameters directly, without ParamParser?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only case I can think of, and would have been a problem already with the pre-existing code, is the case where a cookie included in the request is necessary to get the right content, or some actions are necessary for extraction to work properly, and the server side cannot (yet) inject those automatically.

Still, it may be better to keep things simple for now, and figure out how we wish to solve these issues when we get to that. Because even if we decide to use ParamParser, things are more complicated: it should only be used if automatic parameter parsing is being used, if (raw) zyte_api is used for the source request then some parameters may also need to be copied from there…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the necessary cookies. Fortunately, I haven't encountered this yet in my experiments.

+1 on keeping things simple for now.

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved
results = yield provide({AnyResponse, Product})
assert len(results) == 2
assert type(results[0]) == AnyResponse
assert type(results[1]) == Product
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an assert for type(results[0].response) == HttpResponse? Or should it be BrowserResponse here (i.e. the comment above is outdated)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this test since they duplicate the other test cases that follow it. Not to mention these also lack certain cases. df32f14

tests/test_providers.py Outdated Show resolved Hide resolved
tests/test_providers.py Outdated Show resolved Hide resolved
Comment on lines +695 to +697
# The issue here is that HttpResponseProvider runs earlier than ScrapyZyteAPI.
# HttpResponseProvider doesn't know that it should not run since ScrapyZyteAPI
# could provide HttpResponse in anycase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the failure? Sorry, I haven't checked the logs :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue happens if HttpResponse is explicitly declared as a dependency, say in the PO.

Since HttpResponseProvider runs much earlier than ZyteApiProvider, it would make a request to ZAPI. When it's ZyteApiProvider turn to fullfill dependencies, it would have a 2nd request to ZAPI to fulfill the AnyResponse + Product dependencies.

The ideal scenario would be having only a single ZAPI request that would fulfill all three of the HttpResponse, AnyResponse, and Product dependencies.

It could be the case that in another PR, we can combine HttpResponseProvider and ZyteApiProvider together, or perhaps create a decision mechanism to determine which providers should run that would result in a more optimal dependency creation.

I'm not sure how often this would occur in practice though, since if you have AnyResponse, there's not much need to declare an HttpResponse dependency. With that, it should be easy to avoid.

Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few comments, but it looks good overall @BurnzZ - great work! + to merge after updating to the released scrapy-poet.

@BurnzZ BurnzZ merged commit 87de258 into main Feb 8, 2024
18 checks passed
@wRAR wRAR deleted the http-or-browser-response branch April 24, 2024 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants