ZyteApiProvider could make an unneeded API request #91

kmike · 2023-06-12T10:46:38Z

In the example below ZyteApiProvide makes 2 API requests instead of 1:

@handle_urls("example.com")
@attrs.define
class MyPage(ItemPage[MyItem]):
    html: BrowserHtml
    # ...

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response: DummyResponse, product: Product, my_item: MyItem):
        # ...

Gallaecio · 2023-06-15T13:26:50Z

Findings so far:

Remove ItemProvider’s Response dependency scrapinghub/scrapy-poet#151 won’t fix this.
This issue seems to be caused by Zyte API provided classes being resolved at different stages. If you request both product and browser_response directly in the callback, a single request is sent. Otherwise, first Product is injected, then MyItem resolves to MyPage, then BrowserHtml is injected. I am not sure yet how to best solve that.

wRAR · 2023-06-15T14:59:52Z

Yeah, the problem AFAIK is that ItemProvider calls build_instances itself. scrapinghub/scrapy-poet#151 is actually about a third request done in this or similar use case.

wRAR · 2023-06-15T15:17:32Z

We also thought the solution may involve the caching feature in ItemProvider but didn't investigate further.

Gallaecio · 2023-06-15T15:42:13Z

Indeed.

Gallaecio · 2023-06-20T09:36:38Z

New finding: Switching MyItem to MyPage works, even if there is still some level of indirection. Could explain why scrapinghub/scrapy-poet#153 works.

BurnzZ · 2023-10-03T06:43:42Z

I looked into this further and it still occurs without any Page Objects involved.

The sent Zyte API requests were determined by setting ZYTE_API_LOG_REQUESTS=True.

Given the following spider:

class BooksSpider(scrapy.Spider):
    name = "books"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_nav,
            meta={"zyte_api": {"browserHtml": True}},
        )

Case 1

✅ The following callback set up is correct since it has only 1 request:

# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Case 2

❌ However, the following has 2 separate requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response, navigation: ProductNavigation):
    ...

This case should not happen since browserHtml and productNavigation can both be present in the same Zyte API Request.

Case 3

However, if we introduce a Page Object to the same spider:

@handle_urls("")
@attrs.define
class ProductNavigationPage(ItemPage[ProductNavigation]):
    response: BrowserResponse
    nav_item: ProductNavigation

    @field
    def url(self):
        return self.nav_item.url

    @field
    def categoryName(self) -> str:
        return f"(modified) {self.nav_item.categoryName}"

❌ Then, the following callback set up would have 3 separate Zyte API Requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
# {"browserHtml": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Note that the same series of 3 separate requests still occurs on:

def parse_nav(self, response, navigation: ProductNavigation):
    ...

Gallaecio · 2023-10-03T07:23:22Z

I wonder if some of the unexpected requests are related to #135.

BurnzZ · 2024-01-09T12:19:51Z

Re-opening this since Case 2 is still occurring. Case 3 has been fixed though.

wRAR · 2024-01-11T17:01:54Z

@BurnzZ so do you think after your latest analysis that case 2 still happens or not?

BurnzZ · 2024-01-12T05:48:44Z

@wRAR I can still reproduce Case 2. 👍

wRAR · 2024-01-12T12:20:15Z

OK, so the difference between this use case and ones that we already test is having "browserHtml": True in meta. Currently the provider doesn't check this at all. It looks like it should? cc: @kmike

wRAR · 2024-01-12T13:01:57Z

OTOH I'm not sure if even we handle this in the provider the request itself won't be sent?

kmike · 2024-01-12T15:57:17Z

@wRAR Let's try to focus on how Case 2 (or any of these cases) affect https://github.com/zytedata/zyte-spider-templates, not on the case itself. The priority of supporting meta is not clear to me now; it may not be necessary in the end, or it could be.

rvandam · 2024-12-20T01:57:03Z

I've been working on converting a (working but incomplete) spider from using HttpResponse to BrowserResponse and seem to be getting bit by this bug and I have yet to come up with a satisfactory workaround. I believe my scenario is basically case #3 above but since i'm not using zyte-spider-templates the above referenced fix doesn't apply.

Here's an attempt to simplify what I'm seeing:

class BooksSpider(scrapy.Spider):
    name = "books"
    parse_my_book_page = callback_for(MyBookPage)

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_my_book_page,
            meta={"zyte_api": {"browserHtml": True}},
        )

@handle_urls("")
@attrs.define
class MyBookPage(ItemPage[MyItem]):
    response: BrowserResponse

    @field
    def url(self):
        return self.response.url

    @field
    def something(self): -> str
        return len(self.response.raw_api_response.get('browserHtml'))

This always triggers two requests to zyte api.

You can replace the meta={"zyte_api": {"browserHtml": True}}, line with any other documented method for requesting browserHtml and you will still always end up with 2 browserHtml requests as ZyteApiProvider doesn't make any attempt to check that it already has a response that can function as a valid BrowserResponse.

If you drop the browserHtml flag completely you'll still get 2 requests, one http and one browser.

Since ZyteApiProvider does allow an HttpResponse to fulfill an AnyResponse dependency, I thought I could trick it with this hack but that fails (understandably) with this error got multiple values for argument 'response'. And this approach would be very unintuitive if it did work.

def parse_my_book_page(self, response: HttpResponse, page: MyBookPage):
    yield page.to_item()
...
class MyBookPage(ItemPage[MyItem]):
    response: AnyResponse

Am I going about this the wrong way? Should I be doing something other than yielding my own browserHtml request? The second generated request is correct and does what I want (including the more complex actions and network capture that my real world scenario requires) but I don't see how to avoid the initial unnecessary request. Maybe I should go study the zyte-spider-templates code a little closer to see if I can lift that into my case without using the template.

rvandam · 2024-12-20T04:33:21Z

After some more thought, I've found a successful albeit ugly workaround. By providing my own provider for BrowserResponse at a higher priority than ZyteApiProvider, I can avoid the code that generates the second request. This makes some strong assumptions that won't generalize well but it works well enough for my use case for now until someone proposes a better solution.

class BrowserResponseProvider(PageObjectInputProvider):
    provided_classes = {BrowserResponse}

    def __call__(self, to_provide: Set[Callable], response: Response) -> Sequence[BrowserResponse]:
        browser_html_str = getattr(response, "raw_api_response", {}).get("browserHtml") or response.body
        return [BrowserResponse(url=response.url, status=response.status, html=BrowserHtml(browser_html_str))

Gallaecio · 2024-12-20T08:28:26Z

I think in your case the right approach is something like:

class BooksSpider(scrapy.Spider):
    name = "books"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_my_book_page,
        )
    
    def parse_my_book_page(self, response: DummyResponse, page: MyBookPage, _response: AnyResponse):
        yield page.to_item()

@handle_urls("")
@attrs.define
class MyBookPage(ItemPage[MyItem]):
    response: BrowserResponse

    @field
    def url(self):
        return self.response.url

    @field
    def something(self): -> str
        return len(self.response.raw_api_response.get('browserHtml'))

i.e.:

Remove the request meta, and set the callback response parameter to DummyResponse, so that requests are only sent through dependency injection, and not also through regular Scrapy mechanisms.
Request the response in the callback signature with a different parameter (the first, response parameter is special and cannot be used for dependency injection). You can use AnyResponse in the callback to get whatever the page object dictates, and let the page object be the one forcing browser for the target URL and output type.

Gallaecio self-assigned this Jun 15, 2023

Gallaecio mentioned this issue Jun 15, 2023

Cache provider output per request #94

Merged

Gallaecio mentioned this issue Jun 20, 2023

[WIP] Minimize provider calls scrapinghub/scrapy-poet#153

Closed

Gallaecio added a commit to scrapinghub/scrapy-poet that referenced this issue Jun 20, 2023

Reproduce scrapy-plugins/scrapy-zyte-api#91

96bb926

Gallaecio mentioned this issue Jul 13, 2023

Avoid duplicate requests #105

Merged

wRAR mentioned this issue Nov 8, 2023

Remove ItemProvider’s Response dependency scrapinghub/scrapy-poet#151

Closed

wRAR mentioned this issue Dec 26, 2023

Bump scrapy-poet #156

Merged

wRAR closed this as completed in #156 Jan 5, 2024

BurnzZ reopened this Jan 9, 2024

BurnzZ mentioned this issue Jan 10, 2024

Discussion: Handle data type of response data type in HeuristicsProductNavigationPage (BrowserResponse vs HttpResponse) zytedata/zyte-spider-templates#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZyteApiProvider could make an unneeded API request #91

ZyteApiProvider could make an unneeded API request #91

kmike commented Jun 12, 2023 •

edited

Loading

Gallaecio commented Jun 15, 2023

wRAR commented Jun 15, 2023

wRAR commented Jun 15, 2023 •

edited

Loading

Gallaecio commented Jun 15, 2023

Gallaecio commented Jun 20, 2023

BurnzZ commented Oct 3, 2023 •

edited

Loading

Gallaecio commented Oct 3, 2023

BurnzZ commented Jan 9, 2024

wRAR commented Jan 11, 2024

BurnzZ commented Jan 12, 2024

wRAR commented Jan 12, 2024

wRAR commented Jan 12, 2024

kmike commented Jan 12, 2024

rvandam commented Dec 20, 2024

rvandam commented Dec 20, 2024

Gallaecio commented Dec 20, 2024 •

edited

Loading

ZyteApiProvider could make an unneeded API request #91

ZyteApiProvider could make an unneeded API request #91

Comments

kmike commented Jun 12, 2023 • edited Loading

Gallaecio commented Jun 15, 2023

wRAR commented Jun 15, 2023

wRAR commented Jun 15, 2023 • edited Loading

Gallaecio commented Jun 15, 2023

Gallaecio commented Jun 20, 2023

BurnzZ commented Oct 3, 2023 • edited Loading

Case 1

Case 2

Case 3

Gallaecio commented Oct 3, 2023

BurnzZ commented Jan 9, 2024

wRAR commented Jan 11, 2024

BurnzZ commented Jan 12, 2024

wRAR commented Jan 12, 2024

wRAR commented Jan 12, 2024

kmike commented Jan 12, 2024

rvandam commented Dec 20, 2024

rvandam commented Dec 20, 2024

Gallaecio commented Dec 20, 2024 • edited Loading

kmike commented Jun 12, 2023 •

edited

Loading

wRAR commented Jun 15, 2023 •

edited

Loading

BurnzZ commented Oct 3, 2023 •

edited

Loading

Gallaecio commented Dec 20, 2024 •

edited

Loading