Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SwitchPage #103

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/api-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ Pages
:show-inheritance:
:members:

.. autoclass:: MultiLayoutPage
:show-inheritance:
:members: layout

Mixins
======

Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ web-poet
page-objects/additional-requests
page-objects/fields
page-objects/rules
Webpage layouts <page-objects/layouts>
page-objects/retries
page-objects/page-params

Expand Down
170 changes: 170 additions & 0 deletions docs/page-objects/layouts.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
.. _layouts:

===============
Webpage layouts
===============

Different webpages may show the same *type* of page, but different *data*. For
example, in an e-commerce website there are usually many product detail pages,
each showing data from a different product.

The code that those webpages have in common is their **webpage layout**.

Coding for webpage layouts
==========================

Webpage layouts should inform how you organize your data extraction code.

A good practice to keep your code maintainable is to have a separate :ref:`page
object class <page-objects>` per webpage layout.
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

Trying to support multiple webpage layouts with the same page object class can
make your class hard to maintain.


Identifying webpage layouts
===========================

There is no precise way to determine whether 2 webpages have the same or a
different webpage layout. You must decide based on what you know, and be ready
to adapt if things change.

It is also often difficult to identify webpage layouts before you start writing
extraction code. Completely different webpage layouts can have the same look,
and very similar webpage layouts can look completely different.

It can be a good starting point to assume that, for a given combination of
data type and website, there is going to be a single webpage layout. For
example, assume that all product pages of a given e-commerce website will have
the same webpage layout.

Then, as you write a :ref:`page object class <page-objects>` for that webpage
layout, you may find out more, and adapt.

When the same piece of information must be extracted from a different place for
different webpages, that is a sign that you may be dealing with more than 1
webpage layout. For example, if on some webpages the product name is in an
``h1`` element, but on some webpages it is in an ``h2`` element, chances are
there are at least 2 different webpage layouts.

However, whether you continue to work as if everything uses the same webpage
layout, or you split your page object class into 2 page object classes, each
targeting one of the webpage layouts you have found, it is entirely up to you.

Ask yourself: Is supporting all webpage layout differences making your page
object class implementation only a few lines of code longer, or is it making it
an unmaintainable bowl of spaghetti code?


Mapping webpage layouts
=======================

Once you have written a :ref:`page object class <page-objects>` for a webpage
layout, you need to make it so that your page object class is used for webpages
that use that webpage layout.

URL patterns
------------

Webpage layouts are often associated to specific URL patterns. For example, all
the product detail pages of an e-commerce website usually have similar URLs,
such as ``https://example.com/product/<product ID>``.

When that is the case, you can :ref:`associate your page object class to the
corresponding URL pattern <rules-intro>`.


.. _multi-layout:

Multi-layout page object classes
--------------------------------

Sometimes it is impossible to know, based on the target URL, which webpage
layout you are getting. For example, during `A/B testing`_, you could get a
random webpage layout on every request.

.. _A/B testing: https://en.wikipedia.org/wiki/A/B_testing

For these scenarios, we recommend that you create different page object classes
for the different layouts that you may get, and then write a special
“multi-layout” page object class, and use it to select the right page object
class at run time based on the input you receive.

Your multi-layout page object class should:
BurnzZ marked this conversation as resolved.
Show resolved Hide resolved

#. Declare attributes for the input that you will need to determine which page
object class to use.

For example, declare an :class:`HttpResponse` attribute to select a page
object class based on the response content.

#. Declare an attribute for every page object class that you may use depending
on which webpage layout you get from the target website.

They all should return the same type of :ref:`item <item-classes>` as your
multi-layout page object class.

Note that all inputs of all those page object classes will be resolved and
requested along with the input of your multi-layout page object class. For
example, if one page object class requires browser HTML as input, while
another requires an HTTP response, your multi-layout page object class asks
for both inputs.

If combining different inputs is a problem, consider refactoring your page
object classes to require similar inputs.

#. On its :meth:`~web_poet.pages.ItemPage.to_item` method:

#. Determine, based on inputs, which page object to use.

#. Return the output of the :meth:`~web_poet.pages.ItemPage.to_item`
method of that page object.

You may use :class:`~web_poet.pages.MultiLayoutPage` as a base class for your
multi-layout page object class, so you only need to implement the
:class:`~web_poet.pages.MultiLayoutPage.layout` method that determines which
page object to use. For example:

.. code-block:: python

Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
import attrs
from web_poet import handle_urls, HttpResponse, ItemPage, MultiLayoutPage, WebPage


@attrs.define
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
class Header:
text: str


@attrs.define
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
class H1Page(WebPage[Header]):

@field
def text(self) -> str:
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
return self.css("h1::text").get()


@attrs.define
class H2Page(WebPage[Header]):

@field
def text(self) -> str:
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
return self.css("h2::text").get()


@handle_urls("example.com")
@attrs.define
class HeaderMultiLayoutPage(MultiLayoutPage[Header]):
response: HttpResponse
h1: H1Page
h2: H2Page

async def layout(self) -> ItemPage[Header]:
if self.response.css("h1::text"):
return self.h1
return self.h2

.. note:: If you use :func:`~web_poet.handle_urls` both for your multi-layout
page object class and for any of the page object classes that it
uses, you may need to :ref:`grant your multi-layout page object class
a higher priority <rules-priority-resolution>`.
67 changes: 67 additions & 0 deletions tests/test_pages.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
ItemPage,
ItemT,
ItemWebPage,
MultiLayoutPage,
Returns,
WebPage,
is_injectable,
Expand All @@ -33,6 +34,72 @@ def to_item(self) -> dict:
}


@pytest.mark.asyncio
async def test_multi_layout_page_object():
@attrs.define
class Header:
text: str

@attrs.define
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
class H1Page(WebPage[Header]):
@field
def text(self) -> Optional[str]:
return self.css("h1::text").get()

@attrs.define
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
class H2Page(WebPage[Header]):
@field
def text(self) -> Optional[str]:
return self.css("h2::text").get()

@attrs.define
class HeaderMultiLayoutPage(MultiLayoutPage[Header]):
response: HttpResponse
h1: H1Page
h2: H2Page

async def layout(self) -> ItemPage[Header]:
if self.response.css("h1::text"):
return self.h1
return self.h2

html_h1 = b"""
<!DOCTYPE html>
<html lang="en">
<head>
<title>h1</title>
</head>
<body>
<h1>a</h1>
</body>
</html>
"""
html_h2 = b"""
<!DOCTYPE html>
<html lang="en">
<head>
<title>h2</title>
</head>
<body>
<h2>b</h2>
</body>
</html>
"""

response1 = HttpResponse("https://example.com", body=html_h1)
h1_1 = H1Page(response=response1)
h2_1 = H2Page(response=response1)
response2 = HttpResponse("https://example.com", body=html_h2)
h1_2 = H1Page(response=response2)
h2_2 = H2Page(response=response2)

item1 = await HeaderMultiLayoutPage(response=response1, h1=h1_1, h2=h2_1).to_item()
item2 = await HeaderMultiLayoutPage(response=response2, h1=h1_2, h2=h2_2).to_item()
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

assert item1.text == "a"
assert item2.text == "b"


def test_web_page_object(book_list_html_response) -> None:
class MyWebPage(WebPage):
def to_item(self) -> dict: # type: ignore
Expand Down
2 changes: 1 addition & 1 deletion tox.ini
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[tox]
envlist = py37,py38,py39,py310,py311,mypy,docs,types
envlist = py37,py38,py39,py310,py311,mypy,docs,types,linters

[pytest]
asyncio_mode = strict
Expand Down
19 changes: 19 additions & 0 deletions web_poet/pages.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,25 @@ async def to_item(self) -> ItemT:
)


class MultiLayoutPage(ItemPage[ItemT]):
Copy link
Member

@kmike kmike Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should support fields for MultiLayoutPage, or not.
Currently all fields defined for MultiLayoutPage are silently ignored.

There is also a use case when you need a "partial" layout, layout of a region of a page. In this case, only some fields are extracted by the layout; others are common. It seems we have a few options:

  1. Use a base class with common fields. Create layouts which inherit from the base class and handle necessary regions. Use them in MultiLayoutPage.
  2. Allow to define fields on the MultiLayoutPage. When layout is picked, combine the fields from the layout with the fields defined in the class itself, to compute the final item. Supporting to_item for layouts can be more tricky in this case, as it's not clear which item they should use.

A separate, but related issue is if it's possible to use 2 or more regions, with different layouts, in the same page object.

If fields are supported, it seems it makes sense to move the logic to ItemPage. I think it may simplify typing, and inheritance as well. E.g. layouts can be used with ProductPage from zytedata/zyte-common-items#19 without using multiple inheritance.

It seems that if fields are not supported, it's better to keep MultiLayoutPage as a separate class, and probably raise an error or issue a warning if fields are defined. There is one argument for keeping it separate and not supporting fields: it'd allow to define fields named layout. If we provide a standard method named layout, and allow to define fields, it's not possible to have a field of the same name.

Sorry for a braindump :) What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A disadvantage of using MultiLayoutPage without fields: it's not possible to use fields in the code which uses the page object. So let's say we have ProductPage, it uses fields for data extraction. Then, it's refactored to use MultiLayoutPage as a base class. It means that the fields are no longer supported, and so the code which uses this page may break.

Copy link
Member Author

@Gallaecio Gallaecio Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly against field support in MultiLayoutPage.

In addition to those 2 options you mention, I can think of a 3rd that is a variation on 1 based precisely on how you amended my API proposal for MultiLayoutPage:

  1. Use composition rather than inheritance: define a page object class that can extract the common fields, make it a dependency of the layouts that extract additional fields, and use it in the to_item of those layouts (which would be the layouts which MultiLayoutPage handles).

We would still have the same problem as with the lack of fields in MultiLayoutPage, i.e. you could not access the fields of the dependency layout through the layout that uses it (other than accessing the dependency directly, e.g. layout.dependency.field). But I wonder if we could implement a getattr fallback mechanism to solve this issue for both scenarios: allow MultiLayoutPage to expose the fields of the object that its layout() method returns, and allow layouts with other layouts as dependencies to expose the fields of their dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure about support @fields in MultiLayoutPage.

It would seem it'd be best to keep it's task simple wherein it simply identifies and returns the PO instance based on the layout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one argument for keeping it separate and not supporting fields: it'd allow to define fields named layout. If we provide a standard method named layout, and allow to define fields, it's not possible to have a field of the same name.

This is a valid point, but I think we could switch to get_layout to avoid this issue, and it would be a better method name anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio can you please explain the last idea in more details, e.g. with some code?

fc7867f

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were discussing it today on a call with @proway2. They implemented a multi-layout page object, tried a few approaches. In short - having fields on the final page object is a must :) That's the reason the documented approach here won't work well for them. Taking union of all dependencies is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently they define all the fields, and call to the self._layout in each field. It's a lot of boilerplate; exactly something a library should be solving.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems having fields availabe is required, but not necessarily being able to define top-level fields.

"""Base class for :ref:`multi-layout page object classes <multi-layout>`.

Subclasses must reimplement the :meth:`layout` method.
"""

@abc.abstractmethod
async def layout(self) -> ItemPage[ItemT]:
"""Return the :ref:`page object <page-objects>` to use based on the
received input."""
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
async def to_item(self) -> ItemT:
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
"""Return the output of the :meth:`~web_poet.pages.ItemPage.to_item`
method of the :ref:`page object <page-objects>` that :meth:`layout`
returns."""
page_object = await self.layout()
return await page_object.to_item()


@attr.s(auto_attribs=True)
class WebPage(ItemPage[ItemT], ResponseShortcutsMixin):
"""Base Page Object which requires :class:`~.HttpResponse`
Expand Down