Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce PageObjectRegistry with @hande_urls annotations #27

Merged
merged 14 commits into from
Apr 10, 2022

Conversation

BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Feb 28, 2022

This is built as a light-weight version of the new features introduce by this PR: https://github.com/scrapinghub/web-poet/pull/16/files

In contrast, this only contains a minimal set of features that allows developers to:

  • import and use the default_registry's @handle_urls annotation for declaring OverrideRules.
  • use search_overrides() to find specific OverrideRules amased from importing multiple Page Object Projects (POP).

This means that other non-essential features from the old PR aren't included:

  • registry_pool which allows keeping track of other non-default PageObjectRegistry instances
  • CLI-tool for displaying all of the OverrideRules
  • copy_overrides_from() which allows copying of OverrideRules from multiple non-default PageObjectRegistry instances
  • remove_overrides() since an exclusion-list strategy may be brittle if the underlying source changes. This results in some unintentional OverrideRules to be present in the output if the developer isn't aware of them.
  • replace_overrides() since a more explicit alternative would be creating a new OverrideRule which has the intended attribute replacements.
  • the filter param of get_overrides() since it could break the developer's expectation if new OverrideRules are added to the filter module which could lead to unintentional OverrideRules to be present.

Progress:

  • Tests
  • rewrite docs to focus on more minimal usage

To address in other PRs:

@BurnzZ BurnzZ added the enhancement New feature or request label Feb 28, 2022
@codecov
Copy link

codecov bot commented Feb 28, 2022

Codecov Report

Merging #27 (67e7297) into master (256a0c3) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #27   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         6    +1     
  Lines          127       209   +82     
=========================================
+ Hits           127       209   +82     
Impacted Files Coverage Δ
web_poet/__init__.py 100.00% <100.00%> (ø)
web_poet/overrides.py 100.00% <100.00%> (ø)
web_poet/utils.py 100.00% <100.00%> (ø)

@BurnzZ BurnzZ marked this pull request as ready for review March 1, 2022 07:16
@BurnzZ BurnzZ requested a review from kmike March 1, 2022 07:17
----------------------------------

The :meth:`~.PageObjectRegistry.get_overrides` method from the ``web_poet.default_registry``
allows discovery and retrieval of all :class:`~.OverrideRule` from your project.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think get_overrides is not doing discovery, and also it doesn't limit rules to a certain project:

Suggested change
allows discovery and retrieval of all :class:`~.OverrideRule` from your project.
allows retrieval of all :class:`~.OverrideRule` in the registry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updated this on 37e5ba1


:meth:`~.PageObjectRegistry.get_overrides` relies on the fact that all essential
packages/modules which contains the :func:`web_poet.handle_urls`
annotations are properly loaded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
annotations are properly loaded.
annotations are properly loaded (i.e. imported).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think it may be a bit more clear if in some places further in the docs instead of "loaded" is replaced with "imported". consume_modules is not doing anything fancy, it just imports modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this on 37e5ba1

rules = default_registry.get_overrides()

# Fortunately, `get_overrides()` provides a shortcut for the lines above:
rules = default_registry.get_overrides(consume=["external_package_A.po", "another_ext_package.lib"])
Copy link
Member

@kmike kmike Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the caveats I'm not sure it makes sense to provide consume keyword for get_overrides - the code above reads as it'd return only POs from the listed modules. With

consume_modules("external_package_A.po", "another_ext_package.lib")
rules = default_registry.get_overrides()

the caveat is more clear IMHO, and it's the same amount of code. Do you think we can remove this parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can make it more explicit by having to access consume_modules() externally.

Removed this on 37e5ba1

docs/intro/overrides.rst Outdated Show resolved Hide resolved

There are two main ways we recommend in solving this.

**1. Priority Resolution**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main use case here could be the following:

  1. There is a library of page objects
  2. In your project you want to have a different version for one of POs available in the library

In this case there is no need to modify any override rules, it's a matter of using handle_urls decorator in your own project, and setting a priority higher than the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike to clarify, are you suggesting something like the following?

If we have the conflict like:

rules = default_registry.get_overrides()
print(rules)
# OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, meta={})
# OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, meta={})

It can be fixed in the project as something like:

import handle_urls

handle_urls(ProductGenericPage, overrides=ProductGenericPage, priority=600)(EcomSite2)

At the moment, this won't work since the PageObjectRegistry stores the PO class as the key, and if a rule is already existing for a given PO, it ignores new rules attempted to be registered. Reference.

Do you think it's worth refactoring this from Dict[<PO class>, OverrideRule] into Dict[<PO class>, List[OverrideRule]] instead?

Though this would affect the other implementations like search_overrides() and how it's being read in scrapy-poet as well. Nonetheless, we're still quite early in the process and we can still accommodate this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we have ShopProductPage defined in a common library, along with many other POs:

# shop_crawling_lib

@handle_urls("shop.com", overrides=GenericProductPage)
class ShopProductPage(ItemWebPage):
    # ...

@handle_urls("shop2.com", overrides=GenericProductPage)
class Shop2ProductPage(ItemWebPage):
    # ...

@handle_urls("shop3.com", overrides=GenericProductPage)
class Shop2ProductPage(ItemWebPage):
    # ...

A developer installs and uses this library:

consume_modules("shop_crawling_lib")

Most POs are fine, but the PO for shop.com doesn't fit the bill. So, developer creates a new PO for this:

@handle_urls("shop.com", overrides=GenericProductPage, priority=600)
class ShopProductPage(ItemWebPage):
    # ...

Then we need to make sure all the annotations are discovered:

consume_modules("shop_crawling_lib")
consume_modules("my_project")  # order doesn't matter

rules = default_registry.get_overrides()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike that makes sense! Thanks for explaining it as I initially had the wrong idea. This use case is indeed a common one and it deserves it's own section which I've added in 7944956.

were recently added. This could lead to a `silent-error` of receiving a different
set of rules than expected.

For this approach, you can use the :meth:`~.PageObjectRegistry.search_overrides`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I was thinking about importing POs directly. How would you do inclusion in search_overrides?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another in-depth example for search_overrides() in 7457331.

web_poet/overrides.py Outdated Show resolved Hide resolved
web_poet/overrides.py Outdated Show resolved Hide resolved

- ``my_page_obj_project`` `(since it's the same module as the file above)`
- ``other_external_pkg.po``
- ``another_pkg.lib``
Copy link
Member

@kmike kmike Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus any other modules imported in the same process. I think the example only works if the code above is a whole script, not a module within a larger package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one. Updated this on 37e5ba1


If the ``default_registry`` had other ``@handle_urls`` annotations outside
of the packages/modules listed above, then the corresponding
:class:`~.OverrideRule` won't be returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is only true if they're not imported in some other way.

I think it may be fine to just explain that consume_module imports modules recursively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this in 37e5ba1

@kmike
Copy link
Member

kmike commented Mar 30, 2022

@BurnzZ the PR looks good to me. There are some ways to improve the docs, but it can be done later.

+1 to merge after removing "consume" option from get_overrides method.

@BurnzZ
Copy link
Contributor Author

BurnzZ commented Apr 4, 2022

Thanks for the review @kmike ! I've addressed all of your comments alongside the specific commit for the change. Let me know if there's anything left to update here. 🙏

@BurnzZ BurnzZ requested a review from kmike April 4, 2022 13:39
"""
if value is None:
return []
if not isinstance(value, list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to add a test for tuple, e.g. what should as_list(("foo", "bar", 123)) return? Should it be a single-item list, or a tuple converted to a list?

Copy link
Contributor Author

@BurnzZ BurnzZ Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! After looking into this further, I've added more input types that as_list() should also be able to handle and convert. Updated in 67e7297

@BurnzZ BurnzZ force-pushed the handle_urls_mvp branch from 5eab374 to 67e7297 Compare April 6, 2022 06:00
@kmike kmike merged commit 831754b into master Apr 10, 2022
@kmike
Copy link
Member

kmike commented Apr 10, 2022

Thanks @BurnzZ!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants