Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Handle data type of response data type in HeuristicsProductNavigationPage (BrowserResponse vs HttpResponse) #25

Closed
BurnzZ opened this issue Jan 10, 2024 · 8 comments · Fixed by #28
Labels
question Further information is requested

Comments

@BurnzZ
Copy link
Contributor

BurnzZ commented Jan 10, 2024

Problem

Currently, HeuristicsProductNavigationPage uses HttpResponse as one of the dependencies (code ref). This causes the spider to send 2 separate requests to ZAPI if the extract_from parameter is set to "browserHtml":

  • {"productNavigationOptions": {"extractFrom": "browserHtml"}, "productNavigation": true, "url": "https://books.toscrape.com/"}
  • {"httpResponseBody": true, "url": "https://books.toscrape.com/"}

This is also true when it's left to the default value which is extract_from=None.

Note that this issue is not related to scrapy-plugins/scrapy-zyte-api#91 since scrapinghub/scrapy-poet#175 has addressed it.

The problem that we face is that the dependency of HeuristicsProductNavigationPage should be dynamic depending on the extract_from input value:

  • response: BrowserResponse, if extract_from=browserHtml
  • response: HttpResponse, if extract_from=httpResponseBody
@BurnzZ BurnzZ added the question Further information is requested label Jan 10, 2024
@kmike
Copy link
Contributor

kmike commented Jan 10, 2024

I think scrapinghub/andi#30 should allow to express this kind of dependencies in page objects.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 10, 2024

Yes, it would 👍

Though some new interface in page objects should be made to allow declaration of dynamic dependencies. For example, depending on the extract_from value, it would logically switch between BrowserResponse or HttpResponse.

@kmike
Copy link
Contributor

kmike commented Jan 10, 2024

For example, depending on the extract_from value, it would logically switch between BrowserResponse or HttpResponse.

Yeah, this case is more tricky than the original use case. I was thinking that it still can be handled implicitly: page object defines response: Reuse(BrowserResponse) | HttpResponse dependency (or whatever syntax is). Then, scrapy-poet +scrapy-zyte-api handle it - not just for the case BrowserResponse dependency is defined somewhere explicitly, but also for the case ExtractFrom dependency is used.

@PyExplorer
Copy link
Contributor

@BurnzZ I've checked this approach with AnyResponse for Articles and it's actually resolves handling the case with extract_from (browserHtml or httpResponseBody) argument for spider.
But the issue with doubling the requests is still here.

@PyExplorer
Copy link
Contributor

After double checking I see that this is working, thanks!
In my case, previously, doubled requests were only for HttpResponse, and for BrowserResponse everything was in one response.
Now, I'm just curious if we can handle HttpResponse the same way as BrowserResponse here https://github.com/scrapy-plugins/scrapy-zyte-api/pull/161/files#diff-67cea92ffa3989fe30ffd7f4bbe868f953810e0f4fdf2ddfe7f18bc54fba505eR81 - at this moment PO with HttpResponse is not even handled by ZyteApiProvider

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 17, 2024

Both browserHtml and httpResponseBody should be handled @PyExplorer . For the latter, was the extract_from spider parameter input updated to httpResponseBody?

@PyExplorer
Copy link
Contributor

Yes, extract_from is httpResponseBody, but I mean that here https://github.com/scrapy-plugins/scrapy-zyte-api/pull/161/files#diff-67cea92ffa3989fe30ffd7f4bbe868f953810e0f4fdf2ddfe7f18bc54fba505eR81 in to_provide we don't have HttpResponse (if we asked response: HttpResponse in PO) but we have BrowserResponse (if we ask response: BrowserResponse) in PO

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Jan 19, 2024

in to_provide we don't have HttpResponse (if we asked response: HttpResponse in PO)

The PR in scrapy-plugins/scrapy-zyte-api#161 has been updated so that providers can also use the instances created by earlier providers. That is, HttpResponse is made by HttpResponseProvider. Later on, AnyResponse can be created by ZyteApiProvider as it can use the HttpResponse that was made earlier.

The new PR in scrapinghub/scrapy-poet#184 also made this possible.

@kmike kmike closed this as completed in #28 Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants