Rules
Rules are ApplyRule
objects that tell web-poet which page
object class to use based on user input, i.e. the target
URL and the requested output class (a page object class or an item class).
Rules are necessary if you want to request an item instance, because rules tell web-poet which page object class to use to generate your item instance. Rules can also be useful as documentation or to get information about page object classes programmatically.
Rule precedence can also be useful. For example, to implement generic page object classes that you can override for specific websites.
Defining rules
The handle_urls()
decorator is the simplest way to define a rule for
a page object. For example:
from web_poet import ItemPage, handle_urls
from my_items import MyItem
@handle_urls("example.com")
class MyPage(ItemPage[MyItem]):
...
The code above tells web-poet to use the MyPage
page object class when given a URL pointing to the example.com
domain
name and being asked for MyPage
or MyItem
as output class.
Alternatively, you can manually create and register ApplyRule
objects:
from url_matcher import Patterns
from web_poet import ApplyRule, ItemPage, default_registry
from my_items import MyItem
class MyPage(ItemPage[MyItem]):
...
rule = ApplyRule(
for_patterns=Patterns(include=['example.com']),
use=MyPage,
to_return=MyItem,
)
default_registry.add_rule(rule)
URL patterns
Every rule defines a url_matcher.Patterns
object that determines if
any given URL is a match for the rule.
Patterns
objects offer a simple but powerful syntax for
URL matching. For example:
Pattern |
Behavior |
---|---|
(empty string) |
Matches any URL |
example.com |
Matches any URL on the example.com domain and subdomains |
example.com/products/ |
Matches example.com URLs under the /products/ path |
example.com?productId=* |
Matches example.com URLs with productId=… in their query string |
For details and more examples, see the url-matcher documentation.
When using the handle_urls()
decorator, its include
, exclude
,
and priority
parameters are used to create a Patterns
object. When creating an ApplyRule
object manually, you must create
a Patterns
object yourself and pass it to the
for_patterns
parameter of ApplyRule
.
Rule precedence
Often you define rules so that a given user input, i.e. a combination of a target URL and an output class, can only match 1 rule. However, there are scenarios where it can be useful to define 2 or more rules that can all match a given user input.
For example, you might want to define a “generic” page object class with some default implementation of field extraction, e.g. based on semantic markup or machine learning, and be able to override it based on the input URL, e.g. for specific websites or URL patterns, with a more specific page object class.
For a given user input, when 2 or more rules are a match, web-poet breaks the tie as follows:
One rule can indicate that its page object class overrides another page object class.
This is specified by
ApplyRule.instead_of
. When using thehandle_urls()
decorator, the value comes from theinstead_of
parameter of the decorator.For example, the following page object class would override
MyPage
from above:@handle_urls("example.com", instead_of=MyPage) class OverridingPage(ItemPage[MyItem]): ...
That is:
If the requested output class is
MyPage
, an instance ofOverridingPage
is returned instead.If the requested output class is
MyItem
, an instance ofOverridingPage
is created, and used to build an instance ofMyItem
, which is returned.
One rule can declare a higher priority than another rule, taking precedence.
Rule priority is determined by the value of
ApplyRule.for_patterns.priority
. When using thehandle_urls()
decorator, the value comes from thepriority
parameter of the decorator. Rule priority is 500 by default.For example, given the following page object class:
@handle_urls("example.com", priority=510) class PriorityPage(ItemPage[MyItem]): ...
The following would happen:
If the requested output class is
MyItem
, an instance ofPriorityPage
is created, and used to build an instance ofMyItem
, which is returned.If the requested output class is
MyPage
, an instance ofMyPage
is returned, sincePriorityPage
is not defined as an override forMyPage
.
instead_of
triumphs priority
: If a rule overrides another rule using
instead_of
, it does not matter if the overridden rule had a higher
priority.
When multiple rules override the same page object class, through, priority
can break the tie.
If none of those tie breakers are in place, the first rule added to the
registry takes precedence. However, relying on registration order is
discouraged, and you will get a warning if you register 2 or more rules with
the same URL patterns, same output item class, same priority, and no
instead_of
value. See also Rule conflicts.
Rule registries
Rules should be stored in a RulesRegistry
object.
web-poet defines a default, global RulesRegistry
object at
web_poet.default_registry
. Rules defined with the handle_urls()
decorator are added to this registry.
Loading rules
For a framework to apply your rules, you need to make sure
that your code that adds those rules to web_poet.default_registry
is
executed.
When using the handle_urls()
decorator, that usually means that
you need to make sure that Python imports the files where the decorator is
used.
You can use the consume_modules()
function in some entry
point of your code for that:
from web_poet import consume_modules
consume_modules("my_package.pages", "external_package.pages")
The ideal location for this function depends on your framework. Check the documentation of your framework for more information.
Rule conflicts
A rule conflict occurs when multiple rules have the same instead_of
and
priority
values and can match the same URL.
When it affects rules defined in your code base, solve the conflict adjusting
those instead_of
and priority
values as needed.
When it affects rules from a external package, you have the following options to solve the conflict:
Subclass one of the conflicting page object classes in your code base, using a similar rule except for a tie-breaking change to its
instead_of
orpriority
value.For example, if
package1.A
andpackage2.B
are page object classes with conflicting rules, with a default priority (500), and you wantpackage1.A
to take precedence, declare a new page object class as follows:from package1 import A from web_poet import handle_urls @handle_urls(..., priority=510) class NewA(A): pass
If your framework allows defining a custom list of rules, you could use
web_poet.default_registry
methods likeget_rules()
orsearch()
to build such a list, including only rules that have no conflicts.