-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZyteApiProvider could make an unneeded API request #91
Comments
Findings so far:
|
Yeah, the problem AFAIK is that ItemProvider calls build_instances itself. scrapinghub/scrapy-poet#151 is actually about a third request done in this or similar use case. |
We also thought the solution may involve the caching feature in ItemProvider but didn't investigate further. |
New finding: Switching |
I looked into this further and it still occurs without any Page Objects involved. The sent Zyte API requests were determined by setting Given the following spider: class BooksSpider(scrapy.Spider):
name = "books"
def start_requests(self):
yield scrapy.Request(
url="https://books.toscrape.com",
callback=self.parse_nav,
meta={"zyte_api": {"browserHtml": True}},
) Case 1✅ The following callback set up is correct since it has only 1 request: # {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
... Case 2❌ However, the following has 2 separate requests: # {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response, navigation: ProductNavigation):
... This case should not happen since Case 3However, if we introduce a Page Object to the same spider: @handle_urls("")
@attrs.define
class ProductNavigationPage(ItemPage[ProductNavigation]):
response: BrowserResponse
nav_item: ProductNavigation
@field
def url(self):
return self.nav_item.url
@field
def categoryName(self) -> str:
return f"(modified) {self.nav_item.categoryName}" ❌ Then, the following callback set up would have 3 separate Zyte API Requests: # {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
# {"browserHtml": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
... Note that the same series of 3 separate requests still occurs on: def parse_nav(self, response, navigation: ProductNavigation):
... |
I wonder if some of the unexpected requests are related to #135. |
Re-opening this since Case 2 is still occurring. Case 3 has been fixed though. |
@BurnzZ so do you think after your latest analysis that case 2 still happens or not? |
@wRAR I can still reproduce Case 2. 👍 |
OK, so the difference between this use case and ones that we already test is having |
OTOH I'm not sure if even we handle this in the provider the request itself won't be sent? |
@wRAR Let's try to focus on how Case 2 (or any of these cases) affect https://github.com/zytedata/zyte-spider-templates, not on the case itself. The priority of supporting meta is not clear to me now; it may not be necessary in the end, or it could be. |
I've been working on converting a (working but incomplete) spider from using HttpResponse to BrowserResponse and seem to be getting bit by this bug and I have yet to come up with a satisfactory workaround. I believe my scenario is basically case #3 above but since i'm not using zyte-spider-templates the above referenced fix doesn't apply. Here's an attempt to simplify what I'm seeing: class BooksSpider(scrapy.Spider):
name = "books"
parse_my_book_page = callback_for(MyBookPage)
def start_requests(self):
yield scrapy.Request(
url="https://books.toscrape.com",
callback=self.parse_my_book_page,
meta={"zyte_api": {"browserHtml": True}},
)
@handle_urls("")
@attrs.define
class MyBookPage(ItemPage[MyItem]):
response: BrowserResponse
@field
def url(self):
return self.response.url
@field
def something(self): -> str
return len(self.response.raw_api_response.get('browserHtml')) This always triggers two requests to zyte api. You can replace the If you drop the Since def parse_my_book_page(self, response: HttpResponse, page: MyBookPage):
yield page.to_item()
...
class MyBookPage(ItemPage[MyItem]):
response: AnyResponse Am I going about this the wrong way? Should I be doing something other than yielding my own |
After some more thought, I've found a successful albeit ugly workaround. By providing my own provider for class BrowserResponseProvider(PageObjectInputProvider):
provided_classes = {BrowserResponse}
def __call__(self, to_provide: Set[Callable], response: Response) -> Sequence[BrowserResponse]:
browser_html_str = getattr(response, "raw_api_response", {}).get("browserHtml") or response.body
return [BrowserResponse(url=response.url, status=response.status, html=BrowserHtml(browser_html_str)) |
I think in your case the right approach is something like: class BooksSpider(scrapy.Spider):
name = "books"
def start_requests(self):
yield scrapy.Request(
url="https://books.toscrape.com",
callback=self.parse_my_book_page,
)
def parse_my_book_page(self, response: DummyResponse, page: MyBookPage, _response: AnyResponse):
yield page.to_item()
@handle_urls("")
@attrs.define
class MyBookPage(ItemPage[MyItem]):
response: BrowserResponse
@field
def url(self):
return self.response.url
@field
def something(self): -> str
return len(self.response.raw_api_response.get('browserHtml')) i.e.:
|
In the example below ZyteApiProvide makes 2 API requests instead of 1:
The text was updated successfully, but these errors were encountered: