Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SwitchPage #103

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`.
page-objects/additional-requests
page-objects/fields
page-objects/rules
Webpage layouts <page-objects/layouts>
page-objects/retries
page-objects/page-params

Expand Down
1 change: 1 addition & 0 deletions docs/page-objects/additional-requests.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.. _advanced-requests:
.. _page-objects:

===================
Additional Requests
Expand Down
154 changes: 154 additions & 0 deletions docs/page-objects/layouts.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
.. _layouts:

===============
Webpage layouts
===============

Different webpages may show the same *type* of page, but different *data*. For
example, in an e-commerce website there are usually many product detail pages,
each showing data from a different product.

The code that those webpages have in common is their **webpage layout**.

Coding for webpage layouts
==========================

Webpage layouts should inform how you organize your data extraction code.

A good practice to keep your code maintainable is to have a separate :ref:`page
object class <page-objects>` per webpage layout.
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

Trying to support multiple webpage layouts with the same page object class can
make your class hard to maintain.


Identifying webpage layouts
===========================

There is no precise way to determine whether 2 webpages have the same or a
different webpage layout. You must decide based on what you know, and be ready
to adapt if things change.

It is also often difficult to identify webpage layouts before you start writing
extraction code. Completely different webpage layouts can have the same look,
and very similar webpage layouts can look completely different.

It can be a good starting point to assume that, for a given combination of
data type and website, there is going to be a single webpage layout. For
example, assume that all product pages of a given e-commerce website will have
the same webpage layout.

Then, as you write a :ref:`page object class <page-objects>` for that webpage
layout, you may find out more, and adapt.

When the same piece of information must be extracted from a different place for
different webpages, that is a sign that you may be dealing with more than 1
webpage layout. For example, if on some webpages the product name is in an
``h1`` element, but on some webpages it is in an ``h2`` element, chances are
there are at least 2 different webpage layouts.

However, whether you continue to work as if everything uses the same webpage
layout, or you split your page object class into 2 page object classes, each
targetting one of the webpage layouts you have found, it is entirely up to you.

Ask yourself: Is supporting all webpage layout differences making your page
object class implementation only a few lines of code longer, or is it making it
an unmaintainable bowl of spagetti code?


Mapping webpage layouts
=======================

Once you have written a :ref:`page object class <page-objects>` for a webpage
layout, you need to make it so that your page object class is used for webpages
that use that webpage layout.

URL patterns
------------

Webpage layouts are often associated to specific URL patterns. For example, all
the product detail pages of an e-commerce website usually have similar URLs,
such as ``https://example.com/product/<product ID>``.

When that is the case, you can :ref:`associate your page object class to the
corresponding URL pattern <rules-intro>`.


.. _switch:

Switch page object classes
--------------------------

Sometimes it is impossible to know, based on the target URL, which webpage
layout you are getting. For example, during `A/B testing`_, you could get a
random webpage layout on every request.

.. _A/B testing: https://en.wikipedia.org/wiki/A/B_testing

For these scenarios, we recommend that you create a special “switch” page
object class, and use it to switch to the right page object class at run time
based on the input you receive.

Your switch page object class should:

#. Request all the inputs that the candidate page object classes may need.

For example, if there are 2 candidate page object classes, and 1 of them
requires browser HTML as input, while the other one requires an HTTP
response, your switch page object class must request both.
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

If combining different inputs is a problem, consider refactoring the
candidate page object classes to require similar inputs.

#. On its :meth:`~web_poet.pages.ItemPage.to_item` method:

#. Determine, based on the inputs, which candidate page object class to
use.

#. Create an instance of the selected candidade page object class with the
necessary input, call its :meth:`~web_poet.pages.ItemPage.to_item`
method, and return its result.
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

You may use :class:`~web_poet.pages.SwitchPage` as a base class for your switch
page object class, so you only need to implement the
:class:`~web_poet.pages.SwitchPage.switch` method that determines which
candidate page object class to use. For example:

.. code-block:: python

Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
import attrs
from web_poet import handle_urls, HttpResponse, Injectable, ItemPage, SwitchPage


@attrs.define
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
class Header:
text: str


@attrs.define
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
class H1Page(ItemPage[Header]):
response: HttpResponse

@field
def text(self) -> str:
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
return self.response.css("h1::text").get()
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved


@attrs.define
class H2Page(ItemPage[Header]):
response: HttpResponse

@field
def text(self) -> str:
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
return self.response.css("h2::text").get()


@handle_urls("example.com")
@attrs.define
class HeaderSwitchPage(SwitchPage[Header]):
response: HttpResponse

async def switch(self) -> Injectable:
if self.response.css("h1::text"):
return H1Page
return H2Page
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
69 changes: 69 additions & 0 deletions tests/test_pages.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
ItemT,
ItemWebPage,
Returns,
SwitchPage,
WebPage,
is_injectable,
)
Expand All @@ -33,6 +34,74 @@ def to_item(self) -> dict:
}


@pytest.mark.asyncio
async def test_switch_page_object():

@attrs.define
class Header:
text: str


@attrs.define
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
class H1Page(ItemPage[Header]):
response: HttpResponse

@field
def text(self) -> str:
return self.response.css("h1::text").get()


@attrs.define
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
class H2Page(ItemPage[Header]):
response: HttpResponse

@field
def text(self) -> str:
return self.response.css("h2::text").get()


@attrs.define
class HeaderSwitchPage(SwitchPage[Header]):
response: HttpResponse

async def switch(self) -> Injectable:
if self.response.css("h1::text"):
return H1Page
return H2Page

html_h1 = b"""
<!DOCTYPE html>
<html lang="en">
<head>
<title>h1</title>
</head>
<body>
<h1>a</h1>
</body>
</html>
"""
html_h2 = b"""
<!DOCTYPE html>
<html lang="en">
<head>
<title>h2</title>
</head>
<body>
<h2>b</h2>
</body>
</html>
"""

response1 = HttpResponse("https://example.com", body=html_h1)
response2 = HttpResponse("https://example.com", body=html_h2)

item1 = await HeaderSwitchPage(response=response1).to_item()
item2 = await HeaderSwitchPage(response=response2).to_item()

assert item1.text == "a"
assert item2.text == "b"


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a test for use cases when a MultiLayoutPage subclass also uses multiple MultiLayoutPages underneath. I think this could be a use case when users import POs from different packages and repacking them.

class MyMultiLayoutPage(MultiLayoutPage[SomeItem]):
    response: HttpResponse
    layout_page_us: LayoutPageUS
    layout_page_uk: LayoutPageUK

    async def get_layout(self) -> ItemPage[SomeItem]:
        if self.response.css(".origin::text") == "us":
            return self.layout_page_us.get_layout()
        return self.layout_page_uk.get_layout()

Might also be worth creating a doc about this as well.

def test_web_page_object(book_list_html_response) -> None:
class MyWebPage(WebPage):
def to_item(self) -> dict: # type: ignore
Expand Down
22 changes: 22 additions & 0 deletions web_poet/pages.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,28 @@ async def to_item(self) -> ItemT:
)


class SwitchPage(Injectable, Returns[ItemT]):
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
"""Base class for :ref:`switch page object classes <switch>`.

Subclasses must reimplement the :meth:`switch` method.
"""

@abc.abstractmethod
async def switch(self) -> Injectable:
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
"""Return the right :ref:`page object class <page-objects>` based on
the received input."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about the expected behavior of the implementing framework (e.g. scrapy-poet) when the Switch PO has chosen the PO to parse the page.

Specifically, would the frameworks need to check if the chosen PO is declared in any of the rules' instead_of parameter?

Approach 1

One use case I can think of is:

if layout_1(self.response):
    return ProductLayout1Page
elif layout_2(self.response):
    return ProductLayout2Page
else:
    return AutomaticExtractionPage

In this example, the last resort is Automatic Extraction (e.g. by ML models). If the instead_of parameters are supported, then it would be of great convenience to the user to simply rely on the other overrides stored in the registry to return a more apt PO.

Approach 2

Alternatively, users can simply do:

if layout_1(self.response):
    return ProductLayout1Page
elif layout_2(self.response):
    return ProductLayout2Page
else:
    raise web_poet.DelegateFallback

(Reference: #26)

In this case, it'd be up to the implementing framework to decide if the instead_of parameters are used, or perhaps some other means of declaring and using the fallback PO via some user-defined settings.

Approach 3

The other approach is not to use the instead_of parameters at all since it should exactly follow what PO class the user has returned.

Other approaches

Could perhaps be a combination of the ideas above to give a finer grain of control to the user; or perhaps completely something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no clear idea on what the best approach here is. I think your questions aligns a bit with one of my open question:

Do we need to have some override mechanism for candidate page object classes?

Copy link
Member

@kmike kmike Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to support overrides for candidate page object classes, at least in the first version.

raise NotImplementedError

Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
async def to_item(self) -> ItemT:
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
"""Create an instance of the class that :meth:`switch` returns with the
required input, and return the output of its
:meth:`~web_poet.pages.ItemPage.to_item` method."""
page_object_class = await self.switch()
# page_object = page_object_class(...) # TODO: pass the right inputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky since dependencies are dynamic.

For example:

└── ProductSelectPage
    ├── ProductLayout1Page → needs httpResponseBody
    ├── ProductLayout2Page → needs browserHtml
    └── AutomaticExtractionPage → needs AutoExtractProductData

Providing all three of the dependencies should solve it but it could be expensive.

There should be a way for the Switch/Select/Choose PO to declare the dependency of the PO to use. This won't work in the current way of how dependencies are provided since the Switch/Select/Choose PO is already built as an instance and getting another dependency input is something we need to work on.


Alternatively, I'm thinking if we should glance at the perspective of solving this via overrides with some slight tweaks. For example:

ApplyRule(
    for_patterns="example.com",
    use=web_poet.DynamicPO,
    instead_of=ProductSelectPage,
    to_return=Product,
)

The web_poet.DynamicPO is simply a sentinel class which serves as a stand-in while we still don't know the final PO to use.

If this is present, then the behavior of ApplyRule changes a bit:

  1. instantiate whatever's in the instead_of parameter and resolve any dependencies that it needs (e.g. httpResponseBody)
  2. call the .switch() method (or other names we can think of for this)
  3. this determines the class to be placed in the use parameter
  4. use the ApplyRule as usual

The concept of overrides could fit in here since the Switch/Select/Choose PO is essentially overridden by the PO it has selected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution indeed works best for a scenario where the inputs are the same for all layout-specific page objects, and where the required layout is likely to be random (i.e. A/B testing scenario, as opposed to special URLs where the same URL is likely to get the same layout when you refresh).

In this random scenario, requesting httpResponseBody to determine the layout, determining that the layout is 2, and then fetching browserHtml to parse it with layout 2, would not work, because by the time you get layout 2 the response may be for layout 1.

I think an approach in line with the current proposal may still make sense, at least for some scenarios like the A/B test one.

I wonder if it would make sense to implement 2 different solutions for the 2 different scenarios. Specially because to allow for incremental request of inputs based on their underlying cost, we will probably need a significantly more complicated approach.

page_object = page_object_class(response=self.response)
return await page_object.to_item()


@attr.s(auto_attribs=True)
class WebPage(ItemPage[ItemT], ResponseShortcutsMixin):
"""Base Page Object which requires :class:`~.HttpResponse`
Expand Down