Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ItemProvider’s Response dependency #151

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion scrapy_poet/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,14 @@ def parse(self, response: DummyResponse):
:class:`~.DummyResponse` to your parser instead.
"""

def __init__(self, url: str, request=Optional[Request]):
def __init__(self, url: Optional[str] = None, request: Optional[Request] = None):
if url is None:
if request is None:
raise ValueError(
"One of the parameters, url or request, must have a "
"non-default value."
)
url = request.url
super().__init__(url=url, request=request)


Expand Down
5 changes: 3 additions & 2 deletions scrapy_poet/page_input_providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
)
from web_poet.pages import is_injectable

from scrapy_poet.api import DummyResponse
from scrapy_poet.downloader import create_scrapy_downloader
from scrapy_poet.injection_errors import (
MalformedProvidedClassesError,
Expand Down Expand Up @@ -365,7 +366,6 @@ async def __call__(
self,
to_provide: Set[Callable],
request: Request,
response: Response,
) -> List[Any]:
results = []
for cls in to_provide:
Expand All @@ -392,9 +392,10 @@ async def __call__(
externally_provided=self.injector.is_class_provided_by_any_provider,
)

dummy_response = DummyResponse(request=request)
try:
deferred_or_future = maybe_deferred_to_future(
self.injector.build_instances(request, response, plan)
self.injector.build_instances(request, dummy_response, plan)
)
# RecursionError NOT raised when ``AsyncioSelectorReactor`` is used.
# Could be related: https://github.com/python/cpython/issues/93837
Expand Down
15 changes: 11 additions & 4 deletions tests/test_providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,13 +248,20 @@ def test_page_params_provider(settings):
assert results[0] == expected_data


def test_item_provider_cache(settings):
def test_item_provider(settings):
"""Note that the bulk of the tests for the ``ItemProvider`` alongside the
``Injector`` is tested in ``tests/test_web_poet_rules.py``.
``Injector`` is tested in ``tests/test_web_poet_rules.py``."""
crawler = get_crawler(Spider, settings)
injector = Injector(crawler)
provider = ItemProvider(injector)
request = scrapy.http.Request("https://example.com")

We'll only test its caching behavior here if its properly garbage collected.
"""
# The fact that no exception is raised below proves that a Response
# parameter is not required by ItemProvider.
provider(set(), request)
Copy link
Member

@kmike kmike Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Are there existing tests which ensures that

  1. the original issue is fixed, and
  2. some potential new issues don't appear?

Regarding (2), I was thinking about the following:

class MySpider(scrapy.Spider):
    def parse(self, response: DummyResponse, item: Product):
        # ....


@handle_urls("example.com")
class MyPage(ItemPage[Product]):
    response: HttpResponse

i.e. we start to pass DummyResponse to provider, and response is not used by the callback, but a real response is needed to create a page object which returns an item.

Copy link
Member Author

@Gallaecio Gallaecio Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so you definitely picked up on an issue, and the “solution” I came up with is getting messy, so I would like to discuss it before I move forward further with it, because I might be missing a better solution.

The “solution” consists of having 2 separate item provider classes, one for responseless items and one for responseful items.

Things get more complicated, though. To properly determine if an item needs a response, we need to get the page object for the item, and then check if the dependencies of that page object (which might include other items) are provided by a provider that requires a response. Moreover, we need to take the request (URL) into account, as that can determine which page object is used for an item.

To be honest, it kind of feels like there should be no item provider, just as there is no page object provider, and instead item resolution should be moved closer to the core, and work the same as page object resolution, by somehow making andi realize how to resolve item dependencies. But I am not very familiar with the code base, and I am afraid of wasting too much time exploring in that direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it kind of feels like there should be no item provider, just as there is no page object provider, and instead item resolution should be moved closer to the core, and work the same as page object resolution, by somehow making andi realize how to resolve item dependencies.

Yeah, that would solve scrapy-plugins/scrapy-zyte-api#91 automatically (AFAIK).



def test_item_provider_cache(settings):
crawler = get_crawler(Spider, settings)
injector = Injector(crawler)
provider = ItemProvider(injector)
Expand Down