Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url-matcher integration with scrapy-poet #56

Merged
merged 22 commits into from
May 19, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
b4ac789
url-matcher integration with scrapy-poet
ivanprado Dec 8, 2021
35e7876
Remove a print line
ivanprado Dec 9, 2021
a2902f5
Merge branch 'master' into url-matcher-integration
BurnzZ Dec 21, 2021
327139e
improve docs and example code
BurnzZ Dec 21, 2021
d85766e
deprecate PerDomainOverridesRegistry in lieu of OverridesRegistry
BurnzZ Dec 21, 2021
670715a
improve readability of OverridesRegistry's docs
BurnzZ Dec 21, 2021
706e4ac
improve type annotations and errors in OverridesRegistry
BurnzZ Dec 21, 2021
bf4e61b
improve test coverage
BurnzZ Dec 21, 2021
c865c60
update docs in-line with recent web-poet refactoring
BurnzZ Dec 23, 2021
63029dc
add integration tests for web-poet
BurnzZ Dec 23, 2021
5305da4
fix and improve docs
BurnzZ Jan 5, 2022
2d0c3bc
update docs to reflect new changes from web-poet
BurnzZ Jan 7, 2022
ce23923
update docs with respect to new Override Rules interface from web-poet
BurnzZ Jan 12, 2022
0c94cf6
update docs to reflect web-poet's new 'registry_pool'
BurnzZ Jan 13, 2022
1f52f3b
update docs with web-poet's new MVP version and POP definition
BurnzZ Mar 2, 2022
17689b5
Merge branch 'master' into url-matcher-integration
BurnzZ Mar 2, 2022
10ba139
slight doc improvements
BurnzZ Mar 25, 2022
da93452
improve docs after web-poet PR#27 has been merged
BurnzZ May 2, 2022
e305751
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ May 16, 2022
dd2a302
update imports after web_poet refactoring
BurnzZ May 16, 2022
0588105
fix return type annotation of get_scrapy_data_path()
BurnzZ May 19, 2022
0bc51b8
add override examples using @handle_urls
BurnzZ May 19, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,13 @@ TBR
---

* Use the new ``web_poet.HttpResponse`` which replaces ``web_poet.ResponseData``.
* We have these **backward incompatible** changes since the
``web_poet.OverrideRule`` follow a different structure:

* Deprecated ``PerDomainOverridesRegistry`` in lieu of the newer
``OverridesRegistry`` which provides a wide variety of features
for better URL matching.
* This resuls in a newer format in the ``SCRAPY_POET_OVERRIDES`` setting.


0.3.0 (2022-01-28)
Expand Down
3 changes: 2 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,8 @@
intersphinx_mapping = {
'python': ('https://docs.python.org/3', None, ),
'scrapy': ('https://docs.scrapy.org/en/latest', None, ),
'web_poet': ('https://web-poet.readthedocs.io/en/stable/', None),
'web-poet': ('https://web-poet.readthedocs.io/en/latest/', None),
'url-matcher': ('https://url-matcher.readthedocs.io/en/stable/', None),
}

autodoc_default_options = {
Expand Down
98 changes: 73 additions & 25 deletions docs/intro/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ system. If that’s not the case, see :ref:`intro-install`.

.. note::

This tutorial can be followed without reading `web-poet docs`_, but
This tutorial can be followed without reading `web-poet`_ docs, but
for a better understanding it is highly recommended to check them first.


Expand All @@ -26,7 +26,7 @@ This tutorial will walk you through these tasks:
If you're not already familiar with Scrapy, and want to learn it quickly,
the `Scrapy Tutorial`_ is a good resource.

.. _web-poet docs: https://web-poet.readthedocs.io/en/stable/
.. _web-poet: https://web-poet.readthedocs.io/en/stable/

Creating a spider
=================
Expand Down Expand Up @@ -125,8 +125,8 @@ To use ``scrapy-poet``, enable its downloader middleware in ``settings.py``:
``BookPage`` class we created previously can be used without ``scrapy-poet``,
and even without Scrapy (note that imports were from ``web_poet`` so far).

``scrapy-poet`` makes it easy to use ``web-poet`` Page Objects
(such as BookPage) in Scrapy spiders.
``scrapy-poet`` makes it easy to use `web-poet`_ Page Objects
(such as ``BookPage``) in Scrapy spiders.

Changing spider
===============
Expand Down Expand Up @@ -354,12 +354,10 @@ be done by configuring ``SCRAPY_POET_OVERRIDES`` into ``settings.py``:

.. code-block:: python
SCRAPY_POET_OVERRIDES = {
"toscrape.com": {
BookListPage: BTSBookListPage,
BookPage: BTSBookPage
}
}
"SCRAPY_POET_OVERRIDES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage)
]
The spider is back to life!
``SCRAPY_POET_OVERRIDES`` contain rules that overrides the Page Objects
Expand Down Expand Up @@ -390,15 +388,15 @@ to implement new ones:
class BPBookListPage(WebPage):
def book_urls(self):
return self.css('.article-info a::attr(href)').getall()
return self.css('article.post h4 a::attr(href)').getall()
class BPBookPage(ItemWebPage):
def to_item(self):
return {
'url': self.url,
'name': self.css(".book-data h4::text").get().strip(),
'name': self.css("body div > h1::text").get().strip(),
}
The last step is configuring the overrides so that these new Page Objects
Expand All @@ -408,32 +406,82 @@ are used for the domain

.. code-block:: python
SCRAPY_POET_OVERRIDES = {
"toscrape.com": {
BookListPage: BTSBookListPage,
BookPage: BTSBookPage
},
"bookpage.com": {
BookListPage: BPBookListPage,
BookPage: BPBookPage
}
}
"SCRAPY_POET_OVERRIDES": [
("toscrape.com", BTSBookListPage, BookListPage),
("toscrape.com", BTSBookPage, BookPage),
("bookpage.com", BPBookListPage, BookListPage),
("bookpage.com", BPBookPage, BookPage)
]
The spider is now ready to extract books from both sites 😀.
The full example
`can be seen here <https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders/books_04_overrides_02.py>`_

On a surface, it looks just like a different way to organize Scrapy spider
On the surface, it looks just like a different way to organize Scrapy spider
code - and indeed, it *is* just a different way to organize the code,
but it opens some cool possibilities.

In the examples above we have been configuring the overrides
for a particular domain, but more complex URL patterns are also possible.
For example, the pattern ``books.toscrape.com/cataloge/category/``
is accepted and it would restrict the override only to category pages.

It is even possible to configure more complex patterns by using the
:py:class:`web_poet.overrides.OverrideRule` class instead of a triplet in
the configuration. Another way of declaring the earlier config
for ``SCRAPY_POET_OVERRIDES`` would be the following:

.. code-block:: python
from url_matcher import Patterns
from web_poet import OverrideRule
SCRAPY_POET_OVERRIDES = [
OverrideRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookListPage, instead_of=BookListPage),
OverrideRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookPage, instead_of=BookPage),
OverrideRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookListPage, instead_of=BookListPage),
OverrideRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookPage, instead_of=BookPage),
]
As you can see, this could get verbose. The earlier tuple config simply offers
a shortcut to be more concise.

.. note::

Also see the `url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
documentation for more information about the patterns syntax.

Manually defining overrides like this would be inconvenient, most
especially for larger projects. Fortunately, `web-poet`_ has a cool feature to
annotate Page Objects like :py:func:`web_poet.handle_urls` that would define
and store the :py:class:`web_poet.overrides.OverrideRule` for you. All of the
:py:class:`web_poet.overrides.OverrideRule` rules could then be simply read as:

.. code:: python
from web_poet import default_registry, consume_modules
# The consume_modules() must be called first if you need to properly import
# rules from other packages. Otherwise, it can be omitted.
# More info about this caveat on web-poet docs.
consume_modules("external_package_A", "another_ext_package.lib")
SCRAPY_POET_OVERRIDES = default_registry.get_overrides()
For more info on this, you can refer to these docs:

* ``scrapy-poet``'s :ref:`overrides` Tutorial section.
* External `web-poet`_ docs.

* Specifically, the :external:ref:`intro-overrides` Tutorial section.

Next steps
==========

Now that you know how ``scrapy-poet`` is supposed to work, what about trying to
apply it to an existing or new Scrapy project?

Also, please check :ref:`overrides`, :ref:`providers` and refer to spiders in the "example"
folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
Also, please check the :ref:`overrides` and :ref:`providers` sections as well as
refer to spiders in the "example" folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders

.. _Scrapy Tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html
136 changes: 122 additions & 14 deletions docs/overrides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,18 @@ on the request URL domain. Please have a look to :ref:`intro-tutorial` to
learn the basics about overrides before digging deeper in the content of this
page.

.. tip::

Some real-world examples on this topic can be found in:

- `Example 1 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_01.py>`_:
rules using tuples
- `Example 2 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_02.py>`_:
rules using tuples and :py:class:`web_poet.overrides.OverrideRule`
- `Example 3 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_03.py>`_:
rules using :py:func:`web_poet.handle_urls` decorator and retrieving them
via :py:meth:`web_poet.overrides.PageObjectRegistry.get_overrides`

Page Objects refinement
=======================

Expand Down Expand Up @@ -47,13 +59,11 @@ And then override it for a particular domain using ``settings.py``:

.. code-block:: python
SCRAPY_POET_OVERRIDES = {
"example.com": {
BookPage: ISBNBookPage
}
}
SCRAPY_POET_OVERRIDES = [
("example.com", ISBNBookPage, BookPage)
]
This new Page Objects gets the original ``BookPage`` as dependency and enrich
This new Page Object gets the original ``BookPage`` as dependency and enrich
the obtained item with the ISBN from the page HTML.

.. note::
Expand All @@ -80,20 +90,118 @@ the obtained item with the ISBN from the page HTML.
return item
Overrides rules
===============

The default way of configuring the override rules is using triplets
of the form (``url pattern``, ``override_type``, ``overridden_type``). But more
complex rules can be introduced if the class :py:class:`web_poet.overrides.OverrideRule`
is used. The following example configures an override that is only applied for
book pages from ``books.toscrape.com``:

.. code-block:: python
from web_poet import OverrideRule
SCRAPY_POET_OVERRIDES = [
OverrideRule(
for_patterns=Patterns(
include=["books.toscrape.com/cataloge/*index.html|"],
exclude=["/catalogue/category/"]),
use=MyBookPage,
instead_of=BookPage
)
]
Note how category pages are excluded by using a ``exclude`` pattern.
You can find more information about the patterns syntax in the
`url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
documentation.


Decorate Page Objects with the rules
====================================

Having the rules along with the Page Objects is a good idea,
as you can identify with a single sight what the Page Object is doing
along with where it is applied. This can be done by decorating the
Page Objects with :py:func:`web_poet.handle_urls` provided by `web-poet`_.

.. tip::
Make sure to read the :external:ref:`intro-overrides` Tutorial section of
`web-poet`_ to learn all of its other functionalities that is not covered
in this section.

Let's see an example:

.. code-block:: python
from web_poet import handle_urls
@handle_urls("toscrape.com", BookPage)
class BTSBookPage(BookPage):
def to_item(self):
return {
'url': self.url,
'name': self.css("title::text").get(),
}
The :py:func:`web_poet.handle_urls` decorator in this case is indicating that
the class ``BSTBookPage`` should be used instead of ``BookPage``
for the domain ``toscrape.com``.

In order to configure the ``scrapy-poet`` overrides automatically
using these annotations, you can directly interact with `web-poet`_'s
``default_registry`` (an instance of :py:class:`web_poet.overrides.PageObjectRegistry`).

For example:

.. code-block:: python
from web_poet import default_registry, consume_modules
# The consume_modules() must be called first if you need to properly import
# rules from other packages. Otherwise, it can be omitted.
# More info about this caveat on web-poet docs.
consume_modules("external_package_A", "another_ext_package.lib")
# To get all of the Override Rules that were declared via annotations.
SCRAPY_POET_OVERRIDES = default_registry.get_overrides()
The :py:meth:`web_poet.overrides.PageObjectRegistry.get_overrides` method of the
``default_registry`` above returns ``List[OverrideRule]`` that were declared
using `web-poet`_'s :py:func:`web_poet.handle_urls` annotation. This is much
more convenient that manually defining all of the :py:class:`web_poet.overrides.OverrideRule`.

Take note that since ``SCRAPY_POET_OVERRIDES`` is structured as
``List[OverrideRule]``, you can easily modify it later on if needed.

.. note::

For more info and advanced features of `web-poet`_'s :py:func:`web_poet.handle_urls`
and its registry, kindly read the `web-poet <https://web-poet.readthedocs.io>`_
documentation, specifically its :external:ref:`intro-overrides` tutorial
section.


BurnzZ marked this conversation as resolved.
Show resolved Hide resolved
Overrides registry
==================

The overrides registry is responsible for informing whether there exists an
override for a particular type for a given response. The default overrides
registry keeps a map of overrides for each domain and read this configuration
from settings ``SCRAPY_POET_OVERRIDES`` as has been seen in the :ref:`intro-tutorial`
override for a particular type for a given request. The default overrides
registry allows to configure these rules using patterns that follow the
`url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_ syntax. These rules can be configured using the
``SCRAPY_POET_OVERRIDES`` setting, as it has been seen in the :ref:`intro-tutorial`
example.

But the registry implementation can be changed at convenience. A different
registry implementation can be configured using the property
``SCRAPY_POET_OVERRIDES_REGISTRY`` in ``settings.py``. The new registry
must be a subclass of ``scrapy_poet.overrides.OverridesRegistryBase``
and must implement the method ``overrides_for``. As other Scrapy components,
it can be initialized from the ``from_crawler`` class method if implemented.
This might be handy to be able to access settings, stats, request meta, etc.

must be a subclass of :class:`scrapy_poet.overrides.OverridesRegistryBase` and
must implement the method :meth:`scrapy_poet.overrides.OverridesRegistryBase.overrides_for`.
As other Scrapy components, it can be initialized from the ``from_crawler`` class
method if implemented. This might be handy to be able to access settings, stats,
request meta, etc.
4 changes: 2 additions & 2 deletions docs/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Default: ``None``

Mapping of overrides for each domain. The format of the such ``dict`` mapping
depends on the currently set Registry. The default is currently
:class:`~.PerDomainOverridesRegistry`. This can be overriden by the setting below:
:class:`~.OverridesRegistry`. This can be overriden by the setting below:
``SCRAPY_POET_OVERRIDES_REGISTRY``.

There are sections dedicated for this at :ref:`intro-tutorial` and :ref:`overrides`.
Expand All @@ -36,7 +36,7 @@ SCRAPY_POET_OVERRIDES_REGISTRY

Defaut: ``None``

Sets an alternative Registry to replace the default :class:`~.PerDomainOverridesRegistry`.
Sets an alternative Registry to replace the default :class:`~.OverridesRegistry`.
To use this, set a ``str`` which denotes the absolute object path of the new
Registry.

Expand Down
Loading