Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support AnyResponse #28

Merged
merged 13 commits into from
Feb 9, 2024
Merged

support AnyResponse #28

merged 13 commits into from
Feb 9, 2024

Conversation

BurnzZ
Copy link
Member

@BurnzZ BurnzZ commented Jan 16, 2024

@BurnzZ BurnzZ changed the title use HttpOrBrowserRespose POC: use HttpOrBrowserRespose Jan 16, 2024
@BurnzZ BurnzZ force-pushed the fix-dupe-requests branch 2 times, most recently from 1279996 to 300869d Compare January 16, 2024 10:47
"Whether to perform extraction using a browser request "
"(browserHtml) or an HTTP request (httpResponseBody)."
),
default=ExtractFrom.browserHtml,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate, why is this change needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Zyte API's Auto-Configurator isn't released yet, we can't retrieve the extraction source that was used on a given page. This forces the a default page type to be included in the Zyte API request.

We can revert back to None when the Auto-Configurator is released soon as we'd have a way to get either browserHtml or httpResponseBody that was used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change fixing something, or is it done just in case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done so that ZyteApiProvider can properly request the proper extraction source when building the Zyte API Request: https://github.com/scrapy-plugins/scrapy-zyte-api/pull/161/files#diff-67cea92ffa3989fe30ffd7f4bbe868f953810e0f4fdf2ddfe7f18bc54fba505eR125-R135

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it work if a user changes it back to None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why would there be a regression, from scrapy-zyte-api point of view?

Sorry, I was referring to zyte-spider-templates.

I assume that the code which uses AnyResponse should work fine both with HttpResponse and BrowserResponse, that's why it's declaring AnyResponse dependency. So, when some code declares that it needs AnyResponse, and this code gets HttpResponse, we don't consider it as a regression - it's the expected behavior.

This is currently the case with scrapy-zyte-api and follows the expected behavior. 👍

If there are no other parameters, or no other parameters which require rendering, we can default to HttpResponse - it shouldn't affect anything else, just make AnyResponse use a cheaper method.

I guess my confusion comes from the fact that the ecommerce spider, uses HeuristicsProductNavigationPage which have the following dependencies:

  • ProductNavigation
  • AnyResponse
  • PageParams

In this case, there are no other parameters which require explicit rendering and so scrapy-zyte-api requests for HttpResponse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, there are no other parameters which require explicit rendering and so scrapy-zyte-api requests for HttpResponse.

But currently ProductNavigation does require rendering, right?

Copy link
Member Author

@BurnzZ BurnzZ Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, if ProductNavigation is requested alongside AnyResponse, then it would request the following in Zyte API:

{
    "url": url,
    "productNavigation": True,
    "productNavigationOptions": {"extractFrom": "httpResponseBody"},
    "httpResponseBody": True,
    "httpResponseHeaders": True,
}

Okay hmmm, I guess I already understand what you mean from the previous comment: "1. For now, we use extractFrom=browserHtml when user doesn't state explicitly the extraction method, and AnyResponse is needed by a strategy."

So in this case, having the ProductNavigation and AnyResponse parameters should have:

{
    "url": url,
    "productNavigation": True,
    "productNavigationOptions": {"extractFrom": "browserHtml"},
    "browserHtml": True,
}

But if it's only AnyResponse, then it'd be httpResponseBody and httpResponseHeaders that are requested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe even

{
    "url": url,
    "producNavigation": True,
    "browserHtml": True,
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the clarifications @kmike !

Updated scrapy-zyte-api scrapy-plugins/scrapy-zyte-api@a51f961 and zyte-spider-templates 53b9dfd according to this.

@wRAR wRAR changed the title POC: use HttpOrBrowserRespose POC: use HttpOrBrowserResponse Jan 23, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2024

Codecov Report

Merging #28 (ed384b4) into main (7df8e6c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head ed384b4 differs from pull request most recent head 3ba3a98. Consider uploading reports for the commit 3ba3a98 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #28   +/-   ##
=======================================
  Coverage   98.61%   98.61%           
=======================================
  Files          12       12           
  Lines         504      506    +2     
=======================================
+ Hits          497      499    +2     
  Misses          7        7           
Files Coverage Δ
...r_templates/pages/product_navigation_heuristics.py 100.00% <100.00%> (ø)
zyte_spider_templates/spiders/base.py 100.00% <100.00%> (ø)
zyte_spider_templates/spiders/ecommerce.py 98.79% <ø> (-0.10%) ⬇️

@BurnzZ BurnzZ changed the title POC: use HttpOrBrowserResponse POC: use AnyResponse Jan 31, 2024
@BurnzZ BurnzZ changed the title POC: use AnyResponse support AnyResponse Feb 2, 2024
@BurnzZ BurnzZ marked this pull request as ready for review February 2, 2024 11:00
@BurnzZ
Copy link
Member Author

BurnzZ commented Feb 7, 2024

Failing docs CI seems to be due to a cache issue in the rst files. I cannot reproduce it locally.

@Gallaecio
Copy link
Contributor

I cannot reproduce it either, but there should be no caching in CI, so I am quite puzzled here.

@BurnzZ
Copy link
Member Author

BurnzZ commented Feb 7, 2024

One of the great mysteries of GH 😄 I presume it would resolve itself on master.

@Gallaecio
Copy link
Contributor

Solved the mystery! Merge the main branch. (CI does not actually run in this branch, but in a virtual merge with the main one)

@BurnzZ
Copy link
Member Author

BurnzZ commented Feb 7, 2024

No luck there. I'm tempted to just create another PR once new versions of scrapy-poet and scrapy-zyte-api have been released. It'd be cleaner all together.

@Gallaecio
Copy link
Contributor

Gallaecio commented Feb 7, 2024

I can actually reproduce the issue now with 2a4af89.

@BurnzZ
Copy link
Member Author

BurnzZ commented Feb 7, 2024

Aha, I see the issue now. Thanks for the help @Gallaecio ! 🙌

@kmike kmike merged commit 8718a5f into main Feb 9, 2024
10 checks passed
@wRAR wRAR deleted the fix-dupe-requests branch February 9, 2024 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants