Merge pull request #56 from scrapinghub/url-matcher-integration

url-matcher integration with scrapy-poet
scrapinghub · May 19, 2022 · 53e5b92 · 53e5b92
2 parents 581e0f6 + 0bc51b8
commit 53e5b92
Show file tree

Hide file tree

Showing 20 changed files with 504 additions and 118 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -6,6 +6,13 @@ TBR
 ---
 
 * Use the new ``web_poet.HttpResponse`` which replaces ``web_poet.ResponseData``.
+* We have these **backward incompatible** changes since the
+  ``web_poet.OverrideRule`` follow a different structure:
+
+    * Deprecated ``PerDomainOverridesRegistry`` in lieu of the newer
+      ``OverridesRegistry`` which provides a wide variety of features
+      for better URL matching.
+    * This resuls in a newer format in the ``SCRAPY_POET_OVERRIDES`` setting.
 
 
 0.3.0 (2022-01-28)

diff --git a/docs/conf.py b/docs/conf.py
@@ -188,7 +188,8 @@
 intersphinx_mapping = {
     'python': ('https://docs.python.org/3', None, ),
     'scrapy': ('https://docs.scrapy.org/en/latest', None, ),
-    'web_poet': ('https://web-poet.readthedocs.io/en/stable/', None),
+    'web-poet': ('https://web-poet.readthedocs.io/en/latest/', None),
+    'url-matcher': ('https://url-matcher.readthedocs.io/en/stable/', None),
 }
 
 autodoc_default_options = {

diff --git a/docs/intro/tutorial.rst b/docs/intro/tutorial.rst
@@ -9,7 +9,7 @@ system. If that’s not the case, see :ref:`intro-install`.
 
 .. note::
 
-    This tutorial can be followed without reading `web-poet docs`_, but
+    This tutorial can be followed without reading `web-poet`_ docs, but
     for a better understanding it is highly recommended to check them first.
 
 
@@ -26,7 +26,7 @@ This tutorial will walk you through these tasks:
 If you're not already familiar with Scrapy, and want to learn it quickly,
 the `Scrapy Tutorial`_ is a good resource.
 
-.. _web-poet docs: https://web-poet.readthedocs.io/en/stable/
+.. _web-poet: https://web-poet.readthedocs.io/en/stable/
 
 Creating a spider
 =================
@@ -125,8 +125,8 @@ To use ``scrapy-poet``, enable its downloader middleware in ``settings.py``:
 ``BookPage`` class we created previously can be used without ``scrapy-poet``,
 and even without Scrapy (note that imports were from ``web_poet`` so far).
 
-``scrapy-poet`` makes it easy to use ``web-poet`` Page Objects
-(such as BookPage) in Scrapy spiders.
+``scrapy-poet`` makes it easy to use `web-poet`_ Page Objects
+(such as ``BookPage``) in Scrapy spiders.
 
 Changing spider
 ===============
@@ -354,12 +354,10 @@ be done by configuring ``SCRAPY_POET_OVERRIDES`` into ``settings.py``:
 
 .. code-block:: python
 
-    SCRAPY_POET_OVERRIDES = {
-        "toscrape.com": {
-            BookListPage: BTSBookListPage,
-            BookPage: BTSBookPage
-        }
-    }
+    "SCRAPY_POET_OVERRIDES": [
+        ("toscrape.com", BTSBookListPage, BookListPage),
+        ("toscrape.com", BTSBookPage, BookPage)
+    ]
 
 The spider is back to life!
 ``SCRAPY_POET_OVERRIDES`` contain rules that overrides the Page Objects
@@ -390,15 +388,15 @@ to implement new ones:
     class BPBookListPage(WebPage):
 
         def book_urls(self):
-            return self.css('.article-info a::attr(href)').getall()
+            return self.css('article.post h4 a::attr(href)').getall()
 
 
     class BPBookPage(ItemWebPage):
 
         def to_item(self):
             return {
                 'url': self.url,
-                'name': self.css(".book-data h4::text").get().strip(),
+                'name': self.css("body div > h1::text").get().strip(),
             }
 
 The last step is configuring the overrides so that these new Page Objects
@@ -408,32 +406,82 @@ are used for the domain
 
 .. code-block:: python
 
-    SCRAPY_POET_OVERRIDES = {
-        "toscrape.com": {
-            BookListPage: BTSBookListPage,
-            BookPage: BTSBookPage
-        },
-        "bookpage.com": {
-            BookListPage: BPBookListPage,
-            BookPage: BPBookPage
-        }
-    }
+    "SCRAPY_POET_OVERRIDES": [
+        ("toscrape.com", BTSBookListPage, BookListPage),
+        ("toscrape.com", BTSBookPage, BookPage),
+        ("bookpage.com", BPBookListPage, BookListPage),
+        ("bookpage.com", BPBookPage, BookPage)
+    ]
 
 The spider is now ready to extract books from both sites 😀.
 The full example
 `can be seen here <https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders/books_04_overrides_02.py>`_
 
-On a surface, it looks just like a different way to organize Scrapy spider
+On the surface, it looks just like a different way to organize Scrapy spider
 code - and indeed, it *is* just a different way to organize the code,
 but it opens some cool possibilities.
 
+In the examples above we have been configuring the overrides
+for a particular domain, but more complex URL patterns are also possible.
+For example, the pattern ``books.toscrape.com/cataloge/category/``
+is accepted and it would restrict the override only to category pages.
+
+It is even possible to configure more complex patterns by using the
+:py:class:`web_poet.overrides.OverrideRule` class instead of a triplet in
+the configuration. Another way of declaring the earlier config
+for ``SCRAPY_POET_OVERRIDES`` would be the following:
+
+.. code-block:: python
+
+    from url_matcher import Patterns
+    from web_poet import OverrideRule
+
+
+    SCRAPY_POET_OVERRIDES = [
+        OverrideRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookListPage, instead_of=BookListPage),
+        OverrideRule(for_patterns=Patterns(["toscrape.com"]), use=BTSBookPage, instead_of=BookPage),
+        OverrideRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookListPage, instead_of=BookListPage),
+        OverrideRule(for_patterns=Patterns(["bookpage.com"]), use=BPBookPage, instead_of=BookPage),
+    ]
+
+As you can see, this could get verbose. The earlier tuple config simply offers
+a shortcut to be more concise.
+
+.. note::
+
+    Also see the `url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
+    documentation for more information about the patterns syntax.
+
+Manually defining overrides like this would be inconvenient, most
+especially for larger projects. Fortunately, `web-poet`_ has a cool feature to
+annotate Page Objects like :py:func:`web_poet.handle_urls` that would define
+and store the :py:class:`web_poet.overrides.OverrideRule` for you. All of the
+:py:class:`web_poet.overrides.OverrideRule` rules could then be simply read as:
+
+.. code:: python
+
+    from web_poet import default_registry, consume_modules
+
+    # The consume_modules() must be called first if you need to properly import
+    # rules from other packages. Otherwise, it can be omitted.
+    # More info about this caveat on web-poet docs.
+    consume_modules("external_package_A", "another_ext_package.lib")
+    SCRAPY_POET_OVERRIDES = default_registry.get_overrides()
+
+For more info on this, you can refer to these docs:
+
+    * ``scrapy-poet``'s :ref:`overrides` Tutorial section.
+    * External `web-poet`_ docs.
+
+        * Specifically, the :external:ref:`intro-overrides` Tutorial section.
+
 Next steps
 ==========
 
 Now that you know how ``scrapy-poet`` is supposed to work, what about trying to
 apply it to an existing or new Scrapy project?
 
-Also, please check :ref:`overrides`, :ref:`providers` and refer to spiders in the "example"
-folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
+Also, please check the :ref:`overrides` and :ref:`providers` sections as well as
+refer to spiders in the "example" folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
 
 .. _Scrapy Tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html
diff --git a/docs/overrides.rst b/docs/overrides.rst
@@ -8,6 +8,18 @@ on the request URL domain. Please have a look to :ref:`intro-tutorial` to
 learn the basics about overrides before digging deeper in the content of this
 page.
 
+.. tip::
+
+    Some real-world examples on this topic can be found in:
+
+    - `Example 1 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_01.py>`_:
+      rules using tuples
+    - `Example 2 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_02.py>`_:
+      rules using tuples and :py:class:`web_poet.overrides.OverrideRule`
+    - `Example 3 <https://github.com/scrapinghub/scrapy-poet/blob/master/example/example/spiders/books_04_overrides_03.py>`_:
+      rules using :py:func:`web_poet.handle_urls` decorator and retrieving them
+      via :py:meth:`web_poet.overrides.PageObjectRegistry.get_overrides`
+
 Page Objects refinement
 =======================
 
@@ -47,13 +59,11 @@ And then override it for a particular domain using ``settings.py``:
 
 .. code-block:: python
 
-    SCRAPY_POET_OVERRIDES = {
-        "example.com": {
-            BookPage: ISBNBookPage
-        }
-    }
+    SCRAPY_POET_OVERRIDES = [
+        ("example.com", ISBNBookPage, BookPage)
+    ]
 
-This new Page Objects gets the original ``BookPage`` as dependency and enrich
+This new Page Object gets the original ``BookPage`` as dependency and enrich
 the obtained item with the ISBN from the page HTML.
 
 .. note::
@@ -80,20 +90,118 @@ the obtained item with the ISBN from the page HTML.
                 return item
 
 
+Overrides rules
+===============
+
+The default way of configuring the override rules is using triplets
+of the form (``url pattern``, ``override_type``, ``overridden_type``). But more
+complex rules can be introduced if the class :py:class:`web_poet.overrides.OverrideRule`
+is used. The following example configures an override that is only applied for
+book pages from ``books.toscrape.com``:
+
+.. code-block:: python
+
+    from web_poet import OverrideRule
+
+
+    SCRAPY_POET_OVERRIDES = [
+        OverrideRule(
+            for_patterns=Patterns(
+                include=["books.toscrape.com/cataloge/*index.html|"],
+                exclude=["/catalogue/category/"]),
+            use=MyBookPage,
+            instead_of=BookPage
+        )
+    ]
+
+Note how category pages are excluded by using a ``exclude`` pattern.
+You can find more information about the patterns syntax in the
+`url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_
+documentation.
+
+
+Decorate Page Objects with the rules
+====================================
+
+Having the rules along with the Page Objects is a good idea,
+as you can identify with a single sight what the Page Object is doing
+along with where it is applied. This can be done by decorating the
+Page Objects with :py:func:`web_poet.handle_urls` provided by `web-poet`_.
+
+.. tip::
+    Make sure to read the :external:ref:`intro-overrides` Tutorial section of
+    `web-poet`_ to learn all of its other functionalities that is not covered
+    in this section.
+
+Let's see an example:
+
+.. code-block:: python
+
+    from web_poet import handle_urls
+
+
+    @handle_urls("toscrape.com", BookPage)
+    class BTSBookPage(BookPage):
+
+        def to_item(self):
+            return {
+                'url': self.url,
+                'name': self.css("title::text").get(),
+            }
+
+The :py:func:`web_poet.handle_urls` decorator in this case is indicating that
+the class ``BSTBookPage`` should be used instead of ``BookPage``
+for the domain ``toscrape.com``.
+
+In order to configure the ``scrapy-poet`` overrides automatically
+using these annotations, you can directly interact with `web-poet`_'s
+``default_registry`` (an instance of :py:class:`web_poet.overrides.PageObjectRegistry`).
+
+For example:
+
+.. code-block:: python
+
+    from web_poet import default_registry, consume_modules
+
+    # The consume_modules() must be called first if you need to properly import
+    # rules from other packages. Otherwise, it can be omitted.
+    # More info about this caveat on web-poet docs.
+    consume_modules("external_package_A", "another_ext_package.lib")
+
+    # To get all of the Override Rules that were declared via annotations.
+    SCRAPY_POET_OVERRIDES = default_registry.get_overrides()
+
+The :py:meth:`web_poet.overrides.PageObjectRegistry.get_overrides` method of the
+``default_registry`` above returns ``List[OverrideRule]`` that were declared
+using `web-poet`_'s :py:func:`web_poet.handle_urls` annotation. This is much
+more convenient that manually defining all of the :py:class:`web_poet.overrides.OverrideRule`.
+
+Take note that since ``SCRAPY_POET_OVERRIDES`` is structured as
+``List[OverrideRule]``, you can easily modify it later on if needed.
+
+.. note::
+
+    For more info and advanced features of `web-poet`_'s :py:func:`web_poet.handle_urls`
+    and its registry, kindly read the `web-poet <https://web-poet.readthedocs.io>`_
+    documentation, specifically its :external:ref:`intro-overrides` tutorial
+    section.
+
+
 Overrides registry
 ==================
 
 The overrides registry is responsible for informing whether there exists an
-override for a particular type for a given response. The default overrides
-registry keeps a map of overrides for each domain and read this configuration
-from settings ``SCRAPY_POET_OVERRIDES`` as has been seen in the :ref:`intro-tutorial`
+override for a particular type for a given request. The default overrides
+registry allows to configure these rules using patterns that follow the
+`url-matcher <https://url-matcher.readthedocs.io/en/stable/>`_ syntax. These rules can be configured using the
+``SCRAPY_POET_OVERRIDES`` setting, as it has been seen in the :ref:`intro-tutorial`
 example.
 
 But the registry implementation can be changed at convenience. A different
 registry implementation can be configured using the property
 ``SCRAPY_POET_OVERRIDES_REGISTRY`` in ``settings.py``. The new registry
-must be a subclass of ``scrapy_poet.overrides.OverridesRegistryBase``
-and must implement the method ``overrides_for``. As other Scrapy components,
-it can be initialized from the ``from_crawler`` class method if implemented.
-This might be handy to be able to access settings, stats, request meta, etc.
-
+must be a subclass of :class:`scrapy_poet.overrides.OverridesRegistryBase` and
+must implement the method :meth:`scrapy_poet.overrides.OverridesRegistryBase.overrides_for`.
+As other Scrapy components, it can be initialized from the ``from_crawler`` class
+method if implemented. This might be handy to be able to access settings, stats,
+request meta, etc.
diff --git a/docs/settings.rst b/docs/settings.rst
@@ -25,7 +25,7 @@ Default: ``None``
 
 Mapping of overrides for each domain. The format of the such ``dict`` mapping
 depends on the currently set Registry. The default is currently 
-:class:`~.PerDomainOverridesRegistry`. This can be overriden by the setting below:
+:class:`~.OverridesRegistry`. This can be overriden by the setting below:
 ``SCRAPY_POET_OVERRIDES_REGISTRY``.
 
 There are sections dedicated for this at :ref:`intro-tutorial` and :ref:`overrides`.
@@ -36,7 +36,7 @@ SCRAPY_POET_OVERRIDES_REGISTRY
 
 Defaut: ``None``
 
-Sets an alternative Registry to replace the default :class:`~.PerDomainOverridesRegistry`.
+Sets an alternative Registry to replace the default :class:`~.OverridesRegistry`.
 To use this, set a ``str`` which denotes the absolute object path of the new
 Registry.