Rules

Rules are ApplyRule objects that tell web-poet which page object class to use based on user input, i.e. the target URL and the requested output class (a page object class or an item class).

Rules are necessary if you want to request an item instance, because rules tell web-poet which page object class to use to generate your item instance. Rules can also be useful as documentation or to get information about page object classes programmatically.

Rule precedence can also be useful. For example, to implement generic page object classes that you can override for specific websites.

Defining rules

The handle_urls() decorator is the simplest way to define a rule for a page object. For example:

from web_poet import ItemPage, handle_urls

from my_items import MyItem

@handle_urls("example.com")
class MyPage(ItemPage[MyItem]):
    ...

The code above tells web-poet to use the MyPage page object class when given a URL pointing to the example.com domain name and being asked for MyPage or MyItem as output class.

Alternatively, you can manually create and register ApplyRule objects:

from url_matcher import Patterns
from web_poet import ApplyRule, ItemPage, default_registry

from my_items import MyItem

class MyPage(ItemPage[MyItem]):
    ...

rule = ApplyRule(
    for_patterns=Patterns(include=['example.com']),
    use=MyPage,
    to_return=MyItem,
)
default_registry.add_rule(rule)

URL patterns

Every rule defines a url_matcher.Patterns object that determines if any given URL is a match for the rule.

Patterns objects offer a simple but powerful syntax for URL matching. For example:

Pattern

Behavior

(empty string)

Matches any URL

example.com

Matches any URL on the example.com domain and subdomains

example.com/products/

Matches example.com URLs under the /products/ path

example.com?productId=*

Matches example.com URLs with productId=… in their query string

For details and more examples, see the url-matcher documentation.

When using the handle_urls() decorator, its include, exclude, and priority parameters are used to create a Patterns object. When creating an ApplyRule object manually, you must create a Patterns object yourself and pass it to the for_patterns parameter of ApplyRule.

Rule precedence

Often you define rules so that a given user input, i.e. a combination of a target URL and an output class, can only match 1 rule. However, there are scenarios where it can be useful to define 2 or more rules that can all match a given user input.

For example, you might want to define a “generic” page object class with some default implementation of field extraction, e.g. based on semantic markup or machine learning, and be able to override it based on the input URL, e.g. for specific websites or URL patterns, with a more specific page object class.

For a given user input, when 2 or more rules are a match, web-poet breaks the tie as follows:

  • One rule can indicate that its page object class overrides another page object class.

    This is specified by ApplyRule.instead_of. When using the handle_urls() decorator, the value comes from the instead_of parameter of the decorator.

    For example, the following page object class would override MyPage from above:

    @handle_urls("example.com", instead_of=MyPage)
    class OverridingPage(ItemPage[MyItem]):
        ...
    

    That is:

    • If the requested output class is MyPage, an instance of OverridingPage is returned instead.

    • If the requested output class is MyItem, an instance of OverridingPage is created, and used to build an instance of MyItem, which is returned.

  • One rule can declare a higher priority than another rule, taking precedence.

    Rule priority is determined by the value of ApplyRule.for_patterns.priority. When using the handle_urls() decorator, the value comes from the priority parameter of the decorator. Rule priority is 500 by default.

    For example, given the following page object class:

    @handle_urls("example.com", priority=510)
    class PriorityPage(ItemPage[MyItem]):
        ...
    

    The following would happen:

    • If the requested output class is MyItem, an instance of PriorityPage is created, and used to build an instance of MyItem, which is returned.

    • If the requested output class is MyPage, an instance of MyPage is returned, since PriorityPage is not defined as an override for MyPage.

instead_of triumphs priority: If a rule overrides another rule using instead_of, it does not matter if the overridden rule had a higher priority.

When multiple rules override the same page object class, through, priority can break the tie.

If none of those tie breakers are in place, the first rule added to the registry takes precedence. However, relying on registration order is discouraged, and you will get a warning if you register 2 or more rules with the same URL patterns, same output item class, same priority, and no instead_of value. See also Rule conflicts.

Rule registries

Rules should be stored in a RulesRegistry object.

web-poet defines a default, global RulesRegistry object at web_poet.default_registry. Rules defined with the handle_urls() decorator are added to this registry.

Loading rules

For a framework to apply your rules, you need to make sure that your code that adds those rules to web_poet.default_registry is executed.

When using the handle_urls() decorator, that usually means that you need to make sure that Python imports the files where the decorator is used.

You can use the consume_modules() function in some entry point of your code for that:

from web_poet import consume_modules

consume_modules("my_package.pages", "external_package.pages")

The ideal location for this function depends on your framework. Check the documentation of your framework for more information.

Rule conflicts

A rule conflict occurs when multiple rules have the same instead_of and priority values and can match the same URL.

When it affects rules defined in your code base, solve the conflict adjusting those instead_of and priority values as needed.

When it affects rules from a external package, you have the following options to solve the conflict:

  • Subclass one of the conflicting page object classes in your code base, using a similar rule except for a tie-breaking change to its instead_of or priority value.

    For example, if package1.A and package2.B are page object classes with conflicting rules, with a default priority (500), and you want package1.A to take precedence, declare a new page object class as follows:

    from package1 import A
    from web_poet import handle_urls
    
    @handle_urls(..., priority=510)
    class NewA(A):
        pass
    
  • If your framework allows defining a custom list of rules, you could use web_poet.default_registry methods like get_rules() or search() to build such a list, including only rules that have no conflicts.