Skip to content

Commit

Permalink
Merge pull request #56 from scrapinghub/url-matcher-integration
Browse files Browse the repository at this point in the history
url-matcher integration with scrapy-poet
  • Loading branch information
BurnzZ authored May 19, 2022
2 parents 581e0f6 + 0bc51b8 commit 53e5b92
Show file tree
Hide file tree
Showing 20 changed files with 504 additions and 118 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,13 @@ TBR
---

* Use the new ``web_poet.HttpResponse`` which replaces ``web_poet.ResponseData``.
* We have these **backward incompatible** changes since the
``web_poet.OverrideRule`` follow a different structure:

* Deprecated ``PerDomainOverridesRegistry`` in lieu of the newer
``OverridesRegistry`` which provides a wide variety of features
for better URL matching.
* This resuls in a newer format in the ``SCRAPY_POET_OVERRIDES`` setting.


0.3.0 (2022-01-28)
Expand Down
3 changes: 2 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,8 @@
intersphinx_mapping = {
'python': ('https://docs.python.org/3', None, ),
'scrapy': ('https://docs.scrapy.org/en/latest', None, ),
'web_poet': ('https://web-poet.readthedocs.io/en/stable/', None),
'web-poet': ('https://web-poet.readthedocs.io/en/latest/', None),
'url-matcher': ('https://url-matcher.readthedocs.io/en/stable/', None),
}

autodoc_default_options = {
Expand Down
98 changes: 73 additions & 25 deletions docs/intro/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ system. If that’s not the case, see :ref:`intro-install`.

.. note::

This tutorial can be followed without reading `web-poet docs`_, but
This tutorial can be followed without reading `web-poet`_ docs, but
for a better understanding it is highly recommended to check them first.


Expand All @@ -26,7 +26,7 @@ This tutorial will walk you through these tasks:
If you're not already familiar with Scrapy, and want to learn it quickly,
the `Scrapy Tutorial`_ is a good resource.

.. _web-poet docs: https://web-poet.readthedocs.io/en/stable/
.. _web-poet: https://web-poet.readthedocs.io/en/stable/

Creating a spider
=================
Expand Down Expand Up @@ -125,8 +125,8 @@ To use ``scrapy-poet``, enable its downloader middleware in ``settings.py``:
``BookPage`` class we created previously can be used without ``scrapy-poet``,
and even without Scrapy (note that imports were from ``web_poet`` so far).

``scrapy-poet`` makes it easy to use ``web-poet`` Page Objects
(such as BookPage) in Scrapy spiders.
``scrapy-poet`` makes it easy to use `web-poet`_ Page Objects
(such as ``BookPage``) in Scrapy spiders.

Changing spider
===============
Expand Down Expand Up @@ -354,12 +354,10 @@ be done by configuring ``SCRAPY_POET_OVERRIDES`` into ``settings.py``:

.. code-block:: python
SCRAPY_POET_OVERRIDES = {
"toscrape.com": {
BookListPage: BTSBookListPage,
BookPage: BTSBookPage
}
}
"SCRAPY_POET_OVERRIDES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage)
]
The spider is back to life!
``SCRAPY_POET_OVERRIDES`` contain rules that overrides the Page Objects
Expand Down Expand Up @@ -390,15 +388,15 @@ to implement new ones:
class BPBookListPage(WebPage):
def book_urls(self):
return self.css('.article-info a::attr(href)').getall()
return self.css('article.post h4 a::attr(href)').getall()
class BPBookPage(ItemWebPage):
def to_item(self):
return {
'url': self.url,
'name': self.css(".book-data h4::text").get().strip(),
'name': self.css("body div > h1::text").get().strip(),
}
The last step is configuring the overrides so that these new Page Objects
Expand All @@ -408,32 +406,82 @@ are used for the domain

.. code-block:: python
SCRAPY_POET_OVERRIDES = {
"toscrape.com": {
BookListPage: BTSBookListPage,
BookPage: BTSBookPage
},
"bookpage.com": {
BookListPage: BPBookListPage,
BookPage: BPBookPage
}
}
"SCRAPY_POET_OVERRIDES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage),
("bookpage.com", BPBookListPage, BookListPage),
("bookpage.com", BPBookPage, BookPage)
]
The spider is now ready to extract books from both sites 😀.
The full example
`can be seen here <https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders/books_04_overrides_02.py>`_

On a surface, it looks just like a different way to organize Scrapy spider
On the surface, it looks just like a different way to organize Scrapy spider
code - and indeed, it *is* just a different way to organize the code,
but it opens some cool possibilities.

In the examples above we have been configuring the overrides
for a particular domain, but more complex URL patterns are also possible.
For example, the pattern ``books.toscrape.com/cataloge/category/``
is accepted and it would restrict the override only to category pages.

It is even possible to configure more complex patterns by using the
:py:class:`web_poet.overrides.OverrideRule` class instead of a triplet in
the configuration. Another way of declaring the earlier config
for ``SCRAPY_POET_OVERRIDES`` would be the following:

.. code-block:: python
from url_matcher import Patterns
from web_poet import OverrideRule
SCRAPY_POET_OVERRIDES = [
OverrideRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookListPage, instead_of=BookListPage),
OverrideRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookPage, instead_of=BookPage),
OverrideRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookListPage, instead_of=BookListPage),
OverrideRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookPage, instead_of=BookPage),
]
As you can see, this could get verbose. The earlier tuple config simply offers
a shortcut to be more concise.

.. note::

Also see the `url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
documentation for more information about the patterns syntax.

Manually defining overrides like this would be inconvenient, most
especially for larger projects. Fortunately, `web-poet`_ has a cool feature to
annotate Page Objects like :py:func:`web_poet.handle_urls` that would define
and store the :py:class:`web_poet.overrides.OverrideRule` for you. All of the
:py:class:`web_poet.overrides.OverrideRule` rules could then be simply read as:

.. code:: python
from web_poet import default_registry, consume_modules
# The consume_modules() must be called first if you need to properly import
# rules from other packages. Otherwise, it can be omitted.
# More info about this caveat on web-poet docs.
consume_modules("external_package_A", "another_ext_package.lib")
SCRAPY_POET_OVERRIDES = default_registry.get_overrides()
For more info on this, you can refer to these docs:

* ``scrapy-poet``'s :ref:`overrides` Tutorial section.
* External `web-poet`_ docs.

* Specifically, the :external:ref:`intro-overrides` Tutorial section.

Next steps
==========

Now that you know how ``scrapy-poet`` is supposed to work, what about trying to
apply it to an existing or new Scrapy project?

Also, please check :ref:`overrides`, :ref:`providers` and refer to spiders in the "example"
folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
Also, please check the :ref:`overrides` and :ref:`providers` sections as well as
refer to spiders in the "example" folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders

.. _Scrapy Tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html
136 changes: 122 additions & 14 deletions docs/overrides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,18 @@ on the request URL domain. Please have a look to :ref:`intro-tutorial` to
learn the basics about overrides before digging deeper in the content of this
page.

.. tip::

Some real-world examples on this topic can be found in:

- `Example 1 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_01.py>`_:
rules using tuples
- `Example 2 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_02.py>`_:
rules using tuples and :py:class:`web_poet.overrides.OverrideRule`
- `Example 3 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_03.py>`_:
rules using :py:func:`web_poet.handle_urls` decorator and retrieving them
via :py:meth:`web_poet.overrides.PageObjectRegistry.get_overrides`

Page Objects refinement
=======================

Expand Down Expand Up @@ -47,13 +59,11 @@ And then override it for a particular domain using ``settings.py``:

.. code-block:: python
SCRAPY_POET_OVERRIDES = {
"example.com": {
BookPage: ISBNBookPage
}
}
SCRAPY_POET_OVERRIDES = [
("example.com", ISBNBookPage, BookPage)
]
This new Page Objects gets the original ``BookPage`` as dependency and enrich
This new Page Object gets the original ``BookPage`` as dependency and enrich
the obtained item with the ISBN from the page HTML.

.. note::
Expand All @@ -80,20 +90,118 @@ the obtained item with the ISBN from the page HTML.
return item
Overrides rules
===============

The default way of configuring the override rules is using triplets
of the form (``url pattern``, ``override_type``, ``overridden_type``). But more
complex rules can be introduced if the class :py:class:`web_poet.overrides.OverrideRule`
is used. The following example configures an override that is only applied for
book pages from ``books.toscrape.com``:

.. code-block:: python
from web_poet import OverrideRule
SCRAPY_POET_OVERRIDES = [
OverrideRule(
for_patterns=Patterns(
include=["books.toscrape.com/cataloge/*index.html|"],
exclude=["/catalogue/category/"]),
use=MyBookPage,
instead_of=BookPage
)
]
Note how category pages are excluded by using a ``exclude`` pattern.
You can find more information about the patterns syntax in the
`url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
documentation.


Decorate Page Objects with the rules
====================================

Having the rules along with the Page Objects is a good idea,
as you can identify with a single sight what the Page Object is doing
along with where it is applied. This can be done by decorating the
Page Objects with :py:func:`web_poet.handle_urls` provided by `web-poet`_.

.. tip::
Make sure to read the :external:ref:`intro-overrides` Tutorial section of
`web-poet`_ to learn all of its other functionalities that is not covered
in this section.

Let's see an example:

.. code-block:: python
from web_poet import handle_urls
@handle_urls("toscrape.com", BookPage)
class BTSBookPage(BookPage):
def to_item(self):
return {
'url': self.url,
'name': self.css("title::text").get(),
}
The :py:func:`web_poet.handle_urls` decorator in this case is indicating that
the class ``BSTBookPage`` should be used instead of ``BookPage``
for the domain ``toscrape.com``.

In order to configure the ``scrapy-poet`` overrides automatically
using these annotations, you can directly interact with `web-poet`_'s
``default_registry`` (an instance of :py:class:`web_poet.overrides.PageObjectRegistry`).

For example:

.. code-block:: python
from web_poet import default_registry, consume_modules
# The consume_modules() must be called first if you need to properly import
# rules from other packages. Otherwise, it can be omitted.
# More info about this caveat on web-poet docs.
consume_modules("external_package_A", "another_ext_package.lib")
# To get all of the Override Rules that were declared via annotations.
SCRAPY_POET_OVERRIDES = default_registry.get_overrides()
The :py:meth:`web_poet.overrides.PageObjectRegistry.get_overrides` method of the
``default_registry`` above returns ``List[OverrideRule]`` that were declared
using `web-poet`_'s :py:func:`web_poet.handle_urls` annotation. This is much
more convenient that manually defining all of the :py:class:`web_poet.overrides.OverrideRule`.

Take note that since ``SCRAPY_POET_OVERRIDES`` is structured as
``List[OverrideRule]``, you can easily modify it later on if needed.

.. note::

For more info and advanced features of `web-poet`_'s :py:func:`web_poet.handle_urls`
and its registry, kindly read the `web-poet <https://web-poet.readthedocs.io>`_
documentation, specifically its :external:ref:`intro-overrides` tutorial
section.


Overrides registry
==================

The overrides registry is responsible for informing whether there exists an
override for a particular type for a given response. The default overrides
registry keeps a map of overrides for each domain and read this configuration
from settings ``SCRAPY_POET_OVERRIDES`` as has been seen in the :ref:`intro-tutorial`
override for a particular type for a given request. The default overrides
registry allows to configure these rules using patterns that follow the
`url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_ syntax. These rules can be configured using the
``SCRAPY_POET_OVERRIDES`` setting, as it has been seen in the :ref:`intro-tutorial`
example.

But the registry implementation can be changed at convenience. A different
registry implementation can be configured using the property
``SCRAPY_POET_OVERRIDES_REGISTRY`` in ``settings.py``. The new registry
must be a subclass of ``scrapy_poet.overrides.OverridesRegistryBase``
and must implement the method ``overrides_for``. As other Scrapy components,
it can be initialized from the ``from_crawler`` class method if implemented.
This might be handy to be able to access settings, stats, request meta, etc.

must be a subclass of :class:`scrapy_poet.overrides.OverridesRegistryBase` and
must implement the method :meth:`scrapy_poet.overrides.OverridesRegistryBase.overrides_for`.
As other Scrapy components, it can be initialized from the ``from_crawler`` class
method if implemented. This might be handy to be able to access settings, stats,
request meta, etc.
4 changes: 2 additions & 2 deletions docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Default: ``None``

Mapping of overrides for each domain. The format of the such ``dict`` mapping
depends on the currently set Registry. The default is currently
:class:`~.PerDomainOverridesRegistry`. This can be overriden by the setting below:
:class:`~.OverridesRegistry`. This can be overriden by the setting below:
``SCRAPY_POET_OVERRIDES_REGISTRY``.

There are sections dedicated for this at :ref:`intro-tutorial` and :ref:`overrides`.
Expand All @@ -36,7 +36,7 @@ SCRAPY_POET_OVERRIDES_REGISTRY

Defaut: ``None``

Sets an alternative Registry to replace the default :class:`~.PerDomainOverridesRegistry`.
Sets an alternative Registry to replace the default :class:`~.OverridesRegistry`.
To use this, set a ``str`` which denotes the absolute object path of the new
Registry.

Expand Down
Loading

0 comments on commit 53e5b92

Please sign in to comment.