Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SwitchPage #103

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Add SwitchPage #103

wants to merge 15 commits into from

Conversation

Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Nov 2, 2022

Changes:

  • Implement a SwitchPage class, that can be subclassed to create special page object classes that call other page object classes based on the received input.
  • New documentation page about webpage layouts as a web scraping concept, to provide context before introducing the new class.
    It took me a while to write it in a way that did not presume too much prior knowledge from the reader, and I am still unhappy with the result, so I would love some feedback here.

Arguments for the proposed API:

  • I think the code that determines which page object class to use should be centralized (switch method in the current proposal), as opposed to defining it on each page object class. Otherwise, handling priorities and corner cases could become cumbersome.
  • In regards to the eventual implementation of a mechanism to detect invalid layouts, I do not see a problem. Assuming that we implement it by raising some special exception, such an exception could be raised from switch if no right page object class is found, or from the to_item method of the selected page object class if, despite that page object class getting selected from switch, the input it gets does not match some other expectation.
  • For cases where different candidate page object classes have different inputs, the documentation asks for the switch page object class to be declared with a combination of all of them, and then the right ones would be fed into the selected page object class (to be implemented; I assume it is feasible, but I would like some input on how to approach it).

To do:

  • Discuss the proposed API.
    • Do we need to have some override mechanism for candidate page object classes? i.e. Does there need to be a way through rules to replace a candidate page object class instead of replacing the whole switch page object class? (out of scope)
  • Implement input passing to the selected candidate page object class (andi? inspect and simple type matching?). (layout page object classes are now declared as inputs, letting web-poet frameworks handle this automatically through dependency injection)
  • Provide a nice exception if there is a missing input? (not relevant given the resolution of the bullet point above)

@Gallaecio Gallaecio requested review from kmike, wRAR and BurnzZ November 2, 2022 19:33
@codecov
Copy link

codecov bot commented Nov 2, 2022

Codecov Report

Merging #103 (b32f92f) into master (f7e5be2) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #103   +/-   ##
=======================================
  Coverage   99.56%   99.56%           
=======================================
  Files          25       25           
  Lines         921      927    +6     
=======================================
+ Hits          917      923    +6     
  Misses          4        4           
Impacted Files Coverage Δ
web_poet/pages.py 100.00% <100.00%> (ø)

docs/page-objects/layouts.rst Show resolved Hide resolved
docs/page-objects/layouts.rst Outdated Show resolved Hide resolved
web_poet/pages.py Outdated Show resolved Hide resolved
web_poet/pages.py Outdated Show resolved Hide resolved
@abc.abstractmethod
async def switch(self) -> Injectable:
"""Return the right :ref:`page object class <page-objects>` based on
the received input."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about the expected behavior of the implementing framework (e.g. scrapy-poet) when the Switch PO has chosen the PO to parse the page.

Specifically, would the frameworks need to check if the chosen PO is declared in any of the rules' instead_of parameter?

Approach 1

One use case I can think of is:

if layout_1(self.response):
    return ProductLayout1Page
elif layout_2(self.response):
    return ProductLayout2Page
else:
    return AutomaticExtractionPage

In this example, the last resort is Automatic Extraction (e.g. by ML models). If the instead_of parameters are supported, then it would be of great convenience to the user to simply rely on the other overrides stored in the registry to return a more apt PO.

Approach 2

Alternatively, users can simply do:

if layout_1(self.response):
    return ProductLayout1Page
elif layout_2(self.response):
    return ProductLayout2Page
else:
    raise web_poet.DelegateFallback

(Reference: #26)

In this case, it'd be up to the implementing framework to decide if the instead_of parameters are used, or perhaps some other means of declaring and using the fallback PO via some user-defined settings.

Approach 3

The other approach is not to use the instead_of parameters at all since it should exactly follow what PO class the user has returned.

Other approaches

Could perhaps be a combination of the ideas above to give a finer grain of control to the user; or perhaps completely something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no clear idea on what the best approach here is. I think your questions aligns a bit with one of my open question:

Do we need to have some override mechanism for candidate page object classes?

Copy link
Member

@kmike kmike Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to support overrides for candidate page object classes, at least in the first version.

required input, and return the output of its
:meth:`~web_poet.pages.ItemPage.to_item` method."""
page_object_class = await self.switch()
# page_object = page_object_class(...) # TODO: pass the right inputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky since dependencies are dynamic.

For example:

└── ProductSelectPage
    ├── ProductLayout1Page → needs httpResponseBody
    ├── ProductLayout2Page → needs browserHtml
    └── AutomaticExtractionPage → needs AutoExtractProductData

Providing all three of the dependencies should solve it but it could be expensive.

There should be a way for the Switch/Select/Choose PO to declare the dependency of the PO to use. This won't work in the current way of how dependencies are provided since the Switch/Select/Choose PO is already built as an instance and getting another dependency input is something we need to work on.


Alternatively, I'm thinking if we should glance at the perspective of solving this via overrides with some slight tweaks. For example:

ApplyRule(
    for_patterns="example.com",
    use=web_poet.DynamicPO,
    instead_of=ProductSelectPage,
    to_return=Product,
)

The web_poet.DynamicPO is simply a sentinel class which serves as a stand-in while we still don't know the final PO to use.

If this is present, then the behavior of ApplyRule changes a bit:

  1. instantiate whatever's in the instead_of parameter and resolve any dependencies that it needs (e.g. httpResponseBody)
  2. call the .switch() method (or other names we can think of for this)
  3. this determines the class to be placed in the use parameter
  4. use the ApplyRule as usual

The concept of overrides could fit in here since the Switch/Select/Choose PO is essentially overridden by the PO it has selected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution indeed works best for a scenario where the inputs are the same for all layout-specific page objects, and where the required layout is likely to be random (i.e. A/B testing scenario, as opposed to special URLs where the same URL is likely to get the same layout when you refresh).

In this random scenario, requesting httpResponseBody to determine the layout, determining that the layout is 2, and then fetching browserHtml to parse it with layout 2, would not work, because by the time you get layout 2 the response may be for layout 1.

I think an approach in line with the current proposal may still make sense, at least for some scenarios like the A/B test one.

I wonder if it would make sense to implement 2 different solutions for the 2 different scenarios. Specially because to allow for incremental request of inputs based on their underlying cost, we will probably need a significantly more complicated approach.

web_poet/pages.py Show resolved Hide resolved
web_poet/pages.py Show resolved Hide resolved
docs/page-objects/layouts.rst Outdated Show resolved Hide resolved
docs/page-objects/layouts.rst Outdated Show resolved Hide resolved
docs/page-objects/layouts.rst Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member Author

Note to self: sublayout support.

@Gallaecio
Copy link
Member Author

Applied the improvement proposal by @kmike. It is very clean, and saves me a lot of trouble. It also makes nesting (i.e. multi-layout page object classes that use other multi-layout page object classes) work by default, I believe.

Considering the problems of input mixing and override handling out of scope, my only remaining thoughts are:

  • Should we provide an example of layout nesting in the docs?
  • Should we provide some alternative or complementary API to make it easy to define layout-selection rules on layouts themselves instead of centralizing it on MultiLayoutPage? If not, should we at least provide an example on how to achieve this with the current API?

@@ -68,6 +68,25 @@ async def to_item(self) -> ItemT:
)


class MultiLayoutPage(ItemPage[ItemT]):
Copy link
Member

@kmike kmike Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should support fields for MultiLayoutPage, or not.
Currently all fields defined for MultiLayoutPage are silently ignored.

There is also a use case when you need a "partial" layout, layout of a region of a page. In this case, only some fields are extracted by the layout; others are common. It seems we have a few options:

  1. Use a base class with common fields. Create layouts which inherit from the base class and handle necessary regions. Use them in MultiLayoutPage.
  2. Allow to define fields on the MultiLayoutPage. When layout is picked, combine the fields from the layout with the fields defined in the class itself, to compute the final item. Supporting to_item for layouts can be more tricky in this case, as it's not clear which item they should use.

A separate, but related issue is if it's possible to use 2 or more regions, with different layouts, in the same page object.

If fields are supported, it seems it makes sense to move the logic to ItemPage. I think it may simplify typing, and inheritance as well. E.g. layouts can be used with ProductPage from zytedata/zyte-common-items#19 without using multiple inheritance.

It seems that if fields are not supported, it's better to keep MultiLayoutPage as a separate class, and probably raise an error or issue a warning if fields are defined. There is one argument for keeping it separate and not supporting fields: it'd allow to define fields named layout. If we provide a standard method named layout, and allow to define fields, it's not possible to have a field of the same name.

Sorry for a braindump :) What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A disadvantage of using MultiLayoutPage without fields: it's not possible to use fields in the code which uses the page object. So let's say we have ProductPage, it uses fields for data extraction. Then, it's refactored to use MultiLayoutPage as a base class. It means that the fields are no longer supported, and so the code which uses this page may break.

Copy link
Member Author

@Gallaecio Gallaecio Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly against field support in MultiLayoutPage.

In addition to those 2 options you mention, I can think of a 3rd that is a variation on 1 based precisely on how you amended my API proposal for MultiLayoutPage:

  1. Use composition rather than inheritance: define a page object class that can extract the common fields, make it a dependency of the layouts that extract additional fields, and use it in the to_item of those layouts (which would be the layouts which MultiLayoutPage handles).

We would still have the same problem as with the lack of fields in MultiLayoutPage, i.e. you could not access the fields of the dependency layout through the layout that uses it (other than accessing the dependency directly, e.g. layout.dependency.field). But I wonder if we could implement a getattr fallback mechanism to solve this issue for both scenarios: allow MultiLayoutPage to expose the fields of the object that its layout() method returns, and allow layouts with other layouts as dependencies to expose the fields of their dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure about support @fields in MultiLayoutPage.

It would seem it'd be best to keep it's task simple wherein it simply identifies and returns the PO instance based on the layout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one argument for keeping it separate and not supporting fields: it'd allow to define fields named layout. If we provide a standard method named layout, and allow to define fields, it's not possible to have a field of the same name.

This is a valid point, but I think we could switch to get_layout to avoid this issue, and it would be a better method name anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio can you please explain the last idea in more details, e.g. with some code?

fc7867f

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were discussing it today on a call with @proway2. They implemented a multi-layout page object, tried a few approaches. In short - having fields on the final page object is a must :) That's the reason the documented approach here won't work well for them. Taking union of all dependencies is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently they define all the fields, and call to the self._layout in each field. It's a lot of boilerplate; exactly something a library should be solving.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems having fields availabe is required, but not necessarily being able to define top-level fields.

docs/page-objects/layouts.rst Outdated Show resolved Hide resolved
docs/page-objects/layouts.rst Outdated Show resolved Hide resolved
docs/page-objects/layouts.rst Show resolved Hide resolved
tests/test_pages.py Outdated Show resolved Hide resolved
tests/test_pages.py Outdated Show resolved Hide resolved
docs/page-objects/layouts.rst Outdated Show resolved Hide resolved
docs/page-objects/layouts.rst Outdated Show resolved Hide resolved
docs/page-objects/layouts.rst Show resolved Hide resolved
web_poet/pages.py Outdated Show resolved Hide resolved
assert item2.url == url
assert item2.text == "b"


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a test for use cases when a MultiLayoutPage subclass also uses multiple MultiLayoutPages underneath. I think this could be a use case when users import POs from different packages and repacking them.

class MyMultiLayoutPage(MultiLayoutPage[SomeItem]):
    response: HttpResponse
    layout_page_us: LayoutPageUS
    layout_page_uk: LayoutPageUK

    async def get_layout(self) -> ItemPage[SomeItem]:
        if self.response.css(".origin::text") == "us":
            return self.layout_page_us.get_layout()
        return self.layout_page_uk.get_layout()

Might also be worth creating a doc about this as well.

page1: FullPage1
page2: FullPage2

async def get_layout(self) -> ItemPage[FullItem]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def get_layout(self) -> ItemPage[FullItem]:
async def get_layout(self) -> ItemPage:

Removes the need for the type: ignore[return-value] below.

This should be okay since we're expecting ItemPage (and its subclasses) to be returned. For some reason, mypy fixates on the FullItem return type.

"""

@abc.abstractmethod
async def get_layout(self) -> ItemPage[ItemT]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this sounds right but how about to_layout()? In that way, we establish some sort of consistency with the PO's API naming, i.e. to_item().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants