support AnyResponse #161

BurnzZ · 2024-01-16T10:40:41Z

A way to address zytedata/zyte-spider-templates#25.

Related PRs:

TODO:

tests
docs
Use new release of web-poet
Use new release of scrapy-poet

codecov · 2024-01-16T10:42:11Z

Codecov Report

Merging #161 (382dced) into main (4adedc8) will increase coverage by 0.14%.
Report is 3 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #161      +/-   ##
==========================================
+ Coverage   98.43%   98.57%   +0.14%     
==========================================
  Files          11       11              
  Lines         892      910      +18     
==========================================
+ Hits          878      897      +19     
+ Misses         14       13       -1

Files	Coverage Δ
scrapy_zyte_api/providers.py	`98.31% <100.00%> (+1.26%)`	⬆️

... and 1 file with indirect coverage changes

scrapy_zyte_api/providers.py

BurnzZ · 2024-01-17T13:53:29Z

Added test cases and identified new scenarios that aren't handled properly, so the tests on providers would still break until the implementation is updated.

tests/test_providers.py

scrapy_zyte_api/providers.py

kmike · 2024-01-30T19:12:08Z

scrapy_zyte_api/providers.py

+            elif options_name in zyte_api_meta:
+                extract_from = zyte_api_meta[options_name].get("extractFrom")
+            elif item_type in to_provide_stripped and http_response_needed:
+                zyte_api_meta[options_name] = {"extractFrom": "httpResponseBody"}


To check my understanding: the logic here is that if browserHtml is not requested, but a data type is requested, and there is AnyResponse, then we switch the extractFrom from default to httpResponseBody for this data type?

It'd be, if:

there's no explicit extraction source requested for the given item_type (e.g. Product, ProductNavigation, etc), and

AnyResponse is one of the requested, and

either BrowserResponse or BrowserHtml is not requested, and

HttpResponse has not been created by previous providers (i.e. HttpResponseProvider)

then we use httpResponseBody as the extraction source.

scrapy_zyte_api/providers.py

kmike · 2024-01-30T19:31:57Z

scrapy_zyte_api/providers.py

+                param_parser = _ParamParser(crawler)
+                param_parser._transparent_mode = True
+                http_request_params = param_parser.parse(request)
+                del http_request_params["url"]
+                zyte_api_meta.update(http_request_params)


TBH, this logic is the least clear to me. Could you please add a comment, to explain a bit how it works?

In the docs we write that scrapy-poet integration ignores default parameters. But it seems here they are applied, and ZYTE_API_PROVIDER_PARAMS are ignored instead? Or is it not happening because of some reason?

We parse the original request to get the additional keywords to add to zyte_api_meta. I wonder how it works in cases where the original request contains zyte_api_meta itself, and if behavior is different from how the provider works usually in such cases.

del http_request_params["url"] is mysterious to me :) Why delete the url? Are there other parameters which need to be deleted?

Is the idea here is to handle cookies, headers, etc. in a more consistent way, as compared to just setting httpRequestBody and httpRequestHeaders to True, without invoking ParamParser?

What are the actual differences? What breaks if ParamParser is not used?

In the docs we write that scrapy-poet integration ignores default parameters. But it seems here they are applied, and ZYTE_API_PROVIDER_PARAMS are ignored instead? Or is it not happening because of some reason?

That's a good point. I forgot about this and so using the ParamUser was a way to make handling the headers consistent across the requests. I can't say for the actual differences in practice. For now, we can go with the simplest approach of setting httpResponseBody and httpResponseHeaders to True. b341976

@BurnzZ simply setting httpResponseBody/Headers looks way easier to understand, as it's similar to how all other parameters are handled.

@Gallaecio what do you think about this? Do you see any edge cases with using ParamParser vs just requesting a httpResponseBody/Headers? Any concerns about using the parameters directly, without ParamParser?

The only case I can think of, and would have been a problem already with the pre-existing code, is the case where a cookie included in the request is necessary to get the right content, or some actions are necessary for extraction to work properly, and the server side cannot (yet) inject those automatically.

Still, it may be better to keep things simple for now, and figure out how we wish to solve these issues when we get to that. Because even if we decide to use ParamParser, things are more complicated: it should only be used if automatic parameter parsing is being used, if (raw) zyte_api is used for the source request then some parameters may also need to be copied from there…

Good point about the necessary cookies. Fortunately, I haven't encountered this yet in my experiments.

+1 on keeping things simple for now.

…i into http-or-browser-response

scrapy_zyte_api/providers.py

kmike · 2024-02-07T21:22:54Z

tests/test_providers.py

+    results = yield provide({AnyResponse, Product})
+    assert len(results) == 2
+    assert type(results[0]) == AnyResponse
+    assert type(results[1]) == Product


Should there be an assert for type(results[0].response) == HttpResponse? Or should it be BrowserResponse here (i.e. the comment above is outdated)?

I've removed this test since they duplicate the other test cases that follow it. Not to mention these also lack certain cases. df32f14

tests/test_providers.py

kmike · 2024-02-07T21:39:43Z

tests/test_providers.py

+# The issue here is that HttpResponseProvider runs earlier than ScrapyZyteAPI.
+# HttpResponseProvider doesn't know that it should not run since ScrapyZyteAPI
+# could provide HttpResponse in anycase.


What is the failure? Sorry, I haven't checked the logs :)

The issue happens if HttpResponse is explicitly declared as a dependency, say in the PO.

Since HttpResponseProvider runs much earlier than ZyteApiProvider, it would make a request to ZAPI. When it's ZyteApiProvider turn to fullfill dependencies, it would have a 2nd request to ZAPI to fulfill the AnyResponse + Product dependencies.

The ideal scenario would be having only a single ZAPI request that would fulfill all three of the HttpResponse, AnyResponse, and Product dependencies.

It could be the case that in another PR, we can combine HttpResponseProvider and ZyteApiProvider together, or perhaps create a decision mechanism to determine which providers should run that would result in a more optimal dependency creation.

I'm not sure how often this would occur in practice though, since if you have AnyResponse, there's not much need to declare an HttpResponse dependency. With that, it should be easy to avoid.

kmike

I've added a few comments, but it looks good overall @BurnzZ - great work! + to merge after updating to the released scrapy-poet.

Co-authored-by: Mikhail Korobov <[email protected]>

This was referenced Jan 16, 2024

add new AnyResponse scrapinghub/web-poet#195

Merged

support AnyResponse zytedata/zyte-spider-templates#28

Merged

BurnzZ commented Jan 16, 2024

View reviewed changes

scrapy_zyte_api/providers.py Show resolved Hide resolved

kmike reviewed Jan 16, 2024

View reviewed changes

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved

kmike reviewed Jan 16, 2024

View reviewed changes

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved

kmike reviewed Jan 16, 2024

View reviewed changes

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved

BurnzZ changed the title ~~POC: support HttpOrBrowserResponse~~ POC: support AnyResponse Jan 17, 2024

BurnzZ added 6 commits January 17, 2024 21:55

support HttpOrBrowserResponse

809e179

use new AnyResponse instead of HttpOrBrowserResponse

2f5c69d

use zapi response contents to determine which response to build

b4a79b0

add provider tests for AnyResponse

5c83d22

fix mypy

202fa1b

add more test cases

5f1d104

BurnzZ force-pushed the http-or-browser-response branch from bcba999 to 5f1d104 Compare January 17, 2024 13:55

BurnzZ mentioned this pull request Jan 18, 2024

create a weak_cache in Injector scrapinghub/scrapy-poet#184

Merged

BurnzZ force-pushed the http-or-browser-response branch from fa80f6a to b69580e Compare January 18, 2024 14:15

BurnzZ commented Jan 18, 2024

View reviewed changes

tests/test_providers.py Show resolved Hide resolved

use new weak_ref in scrapy_poet's Injector to handle more cases

3cb2290

BurnzZ force-pushed the http-or-browser-response branch from b69580e to 3cb2290 Compare January 18, 2024 14:18

BurnzZ changed the title ~~POC: support AnyResponse~~ support AnyResponse Jan 18, 2024

BurnzZ requested review from kmike, Gallaecio and PyExplorer January 18, 2024 14:22

BurnzZ marked this pull request as ready for review January 18, 2024 14:24

BurnzZ requested a review from wRAR January 18, 2024 14:31

BurnzZ commented Jan 18, 2024

View reviewed changes

tests/test_providers.py Outdated Show resolved Hide resolved

BurnzZ added 2 commits January 19, 2024 15:19

BrowserResponse takes precedence in AnyResponse

85f355d

add more test cases

297eb57

Gallaecio reviewed Jan 24, 2024

View reviewed changes

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved

Gallaecio reviewed Jan 24, 2024

View reviewed changes

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved

Gallaecio reviewed Jan 24, 2024

View reviewed changes

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved

small comments and improvements

ffa345f

Gallaecio approved these changes Jan 24, 2024

View reviewed changes

PyExplorer approved these changes Jan 25, 2024

View reviewed changes

add test for item return

7043c55

kmike reviewed Jan 30, 2024

View reviewed changes

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved

kmike reviewed Jan 30, 2024

View reviewed changes

BurnzZ added 7 commits January 31, 2024 16:58

handle empty itemOptions

3fec1b8

avoid using ParamParser

b341976

Merge branch 'main' of ssh://github.com/scrapy-plugins/scrapy-zyte-ap…

60f5c2b

…i into http-or-browser-response

use web-poet>=0.16.0

2c5e506

temporarily use scrapy-poet master branch

6c1e043

Merge branch 'main' of ssh://github.com/scrapy-plugins/scrapy-zyte-ap…

571ee63

…i into http-or-browser-response

use browserHtml if no extraction source is provided with item types

a51f961

BurnzZ mentioned this pull request Feb 5, 2024

Remove the cache from the provider. #172

Closed

kmike reviewed Feb 7, 2024

View reviewed changes

scrapy_zyte_api/providers.py Outdated Show resolved Hide resolved

remove cache updates since Injector is already doing it

9da17b4

kmike reviewed Feb 7, 2024

View reviewed changes

tests/test_providers.py Outdated Show resolved Hide resolved

kmike reviewed Feb 7, 2024

View reviewed changes

tests/test_providers.py Outdated Show resolved Hide resolved

kmike reviewed Feb 7, 2024

View reviewed changes

kmike approved these changes Feb 7, 2024

View reviewed changes

BurnzZ and others added 3 commits February 8, 2024 14:20

Update tests on instance identity

d389893

Co-authored-by: Mikhail Korobov <[email protected]>

remove duplicate test cases for AnyResponse

df32f14

use scrapy-poet 0.21.0

382dced

BurnzZ merged commit 87de258 into main Feb 8, 2024
18 checks passed

wRAR deleted the http-or-browser-response branch April 24, 2024 06:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support AnyResponse #161

support AnyResponse #161

BurnzZ commented Jan 16, 2024 •

edited

Loading

codecov bot commented Jan 16, 2024 •

edited

Loading

BurnzZ commented Jan 17, 2024 •

edited

Loading

kmike Jan 30, 2024

BurnzZ Jan 31, 2024

kmike Jan 30, 2024

kmike Jan 30, 2024

BurnzZ Jan 31, 2024

kmike Jan 31, 2024

Gallaecio Jan 31, 2024

BurnzZ Jan 31, 2024

kmike Feb 7, 2024

BurnzZ Feb 8, 2024

kmike Feb 7, 2024

BurnzZ Feb 8, 2024

kmike left a comment

support AnyResponse #161

support AnyResponse #161

Conversation

BurnzZ commented Jan 16, 2024 • edited Loading

TODO:

codecov bot commented Jan 16, 2024 • edited Loading

Codecov Report

BurnzZ commented Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike left a comment

Choose a reason for hiding this comment

BurnzZ commented Jan 16, 2024 •

edited

Loading

codecov bot commented Jan 16, 2024 •

edited

Loading

BurnzZ commented Jan 17, 2024 •

edited

Loading