Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZyteApiProvider could make an unneeded API request #91

Open
kmike opened this issue Jun 12, 2023 · 16 comments · Fixed by #156
Open

ZyteApiProvider could make an unneeded API request #91

kmike opened this issue Jun 12, 2023 · 16 comments · Fixed by #156
Assignees

Comments

@kmike
Copy link
Member

kmike commented Jun 12, 2023

In the example below ZyteApiProvide makes 2 API requests instead of 1:

@handle_urls("example.com")
@attrs.define
class MyPage(ItemPage[MyItem]):
    html: BrowserHtml
    # ...

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response: DummyResponse, product: Product, my_item: MyItem):
        # ...
@Gallaecio
Copy link
Contributor

Findings so far:

  • Remove ItemProvider’s Response dependency scrapinghub/scrapy-poet#151 won’t fix this.
  • This issue seems to be caused by Zyte API provided classes being resolved at different stages. If you request both product and browser_response directly in the callback, a single request is sent. Otherwise, first Product is injected, then MyItem resolves to MyPage, then BrowserHtml is injected. I am not sure yet how to best solve that.

@Gallaecio Gallaecio self-assigned this Jun 15, 2023
@wRAR
Copy link
Contributor

wRAR commented Jun 15, 2023

Yeah, the problem AFAIK is that ItemProvider calls build_instances itself. scrapinghub/scrapy-poet#151 is actually about a third request done in this or similar use case.

@wRAR
Copy link
Contributor

wRAR commented Jun 15, 2023

We also thought the solution may involve the caching feature in ItemProvider but didn't investigate further.

@Gallaecio
Copy link
Contributor

Indeed.

@Gallaecio
Copy link
Contributor

New finding: Switching MyItem to MyPage works, even if there is still some level of indirection. Could explain why scrapinghub/scrapy-poet#153 works.

Gallaecio added a commit to scrapinghub/scrapy-poet that referenced this issue Jun 20, 2023
@BurnzZ
Copy link
Member

BurnzZ commented Oct 3, 2023

I looked into this further and it still occurs without any Page Objects involved.

The sent Zyte API requests were determined by setting ZYTE_API_LOG_REQUESTS=True.

Given the following spider:

class BooksSpider(scrapy.Spider):
    name = "books"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_nav,
            meta={"zyte_api": {"browserHtml": True}},
        )

Case 1

✅ The following callback set up is correct since it has only 1 request:

# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Case 2

❌ However, the following has 2 separate requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response, navigation: ProductNavigation):
    ...

This case should not happen since browserHtml and productNavigation can both be present in the same Zyte API Request.

Case 3

However, if we introduce a Page Object to the same spider:

@handle_urls("")
@attrs.define
class ProductNavigationPage(ItemPage[ProductNavigation]):
    response: BrowserResponse
    nav_item: ProductNavigation

    @field
    def url(self):
        return self.nav_item.url

    @field
    def categoryName(self) -> str:
        return f"(modified) {self.nav_item.categoryName}"

❌ Then, the following callback set up would have 3 separate Zyte API Requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
# {"browserHtml": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Note that the same series of 3 separate requests still occurs on:

def parse_nav(self, response, navigation: ProductNavigation):
    ...

@Gallaecio
Copy link
Contributor

I wonder if some of the unexpected requests are related to #135.

@BurnzZ
Copy link
Member

BurnzZ commented Jan 9, 2024

Re-opening this since Case 2 is still occurring. Case 3 has been fixed though.

@wRAR
Copy link
Contributor

wRAR commented Jan 11, 2024

@BurnzZ so do you think after your latest analysis that case 2 still happens or not?

@BurnzZ
Copy link
Member

BurnzZ commented Jan 12, 2024

@wRAR I can still reproduce Case 2. 👍

@wRAR
Copy link
Contributor

wRAR commented Jan 12, 2024

OK, so the difference between this use case and ones that we already test is having "browserHtml": True in meta. Currently the provider doesn't check this at all. It looks like it should? cc: @kmike

@wRAR
Copy link
Contributor

wRAR commented Jan 12, 2024

OTOH I'm not sure if even we handle this in the provider the request itself won't be sent?

@kmike
Copy link
Member Author

kmike commented Jan 12, 2024

@wRAR Let's try to focus on how Case 2 (or any of these cases) affect https://github.com/zytedata/zyte-spider-templates, not on the case itself. The priority of supporting meta is not clear to me now; it may not be necessary in the end, or it could be.

@rvandam
Copy link

rvandam commented Dec 20, 2024

I've been working on converting a (working but incomplete) spider from using HttpResponse to BrowserResponse and seem to be getting bit by this bug and I have yet to come up with a satisfactory workaround. I believe my scenario is basically case #3 above but since i'm not using zyte-spider-templates the above referenced fix doesn't apply.

Here's an attempt to simplify what I'm seeing:

class BooksSpider(scrapy.Spider):
    name = "books"
    parse_my_book_page = callback_for(MyBookPage)

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_my_book_page,
            meta={"zyte_api": {"browserHtml": True}},
        )

@handle_urls("")
@attrs.define
class MyBookPage(ItemPage[MyItem]):
    response: BrowserResponse

    @field
    def url(self):
        return self.response.url

    @field
    def something(self): -> str
        return len(self.response.raw_api_response.get('browserHtml'))

This always triggers two requests to zyte api.

You can replace the meta={"zyte_api": {"browserHtml": True}}, line with any other documented method for requesting browserHtml and you will still always end up with 2 browserHtml requests as ZyteApiProvider doesn't make any attempt to check that it already has a response that can function as a valid BrowserResponse.

If you drop the browserHtml flag completely you'll still get 2 requests, one http and one browser.

Since ZyteApiProvider does allow an HttpResponse to fulfill an AnyResponse dependency, I thought I could trick it with this hack but that fails (understandably) with this error got multiple values for argument 'response'. And this approach would be very unintuitive if it did work.

def parse_my_book_page(self, response: HttpResponse, page: MyBookPage):
    yield page.to_item()
...
class MyBookPage(ItemPage[MyItem]):
    response: AnyResponse

Am I going about this the wrong way? Should I be doing something other than yielding my own browserHtml request? The second generated request is correct and does what I want (including the more complex actions and network capture that my real world scenario requires) but I don't see how to avoid the initial unnecessary request. Maybe I should go study the zyte-spider-templates code a little closer to see if I can lift that into my case without using the template.

@rvandam
Copy link

rvandam commented Dec 20, 2024

After some more thought, I've found a successful albeit ugly workaround. By providing my own provider for BrowserResponse at a higher priority than ZyteApiProvider, I can avoid the code that generates the second request. This makes some strong assumptions that won't generalize well but it works well enough for my use case for now until someone proposes a better solution.

class BrowserResponseProvider(PageObjectInputProvider):
    provided_classes = {BrowserResponse}

    def __call__(self, to_provide: Set[Callable], response: Response) -> Sequence[BrowserResponse]:
        browser_html_str = getattr(response, "raw_api_response", {}).get("browserHtml") or response.body
        return [BrowserResponse(url=response.url, status=response.status, html=BrowserHtml(browser_html_str))

@Gallaecio
Copy link
Contributor

Gallaecio commented Dec 20, 2024

I think in your case the right approach is something like:

class BooksSpider(scrapy.Spider):
    name = "books"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_my_book_page,
        )
    
    def parse_my_book_page(self, response: DummyResponse, page: MyBookPage, _response: AnyResponse):
        yield page.to_item()

@handle_urls("")
@attrs.define
class MyBookPage(ItemPage[MyItem]):
    response: BrowserResponse

    @field
    def url(self):
        return self.response.url

    @field
    def something(self): -> str
        return len(self.response.raw_api_response.get('browserHtml'))

i.e.:

  • Remove the request meta, and set the callback response parameter to DummyResponse, so that requests are only sent through dependency injection, and not also through regular Scrapy mechanisms.
  • Request the response in the callback signature with a different parameter (the first, response parameter is special and cannot be used for dependency injection). You can use AnyResponse in the callback to get whatever the page object dictates, and let the page object be the one forcing browser for the target URL and output type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants