API Reference

Page Inputs

class web_poet.page_inputs.browser.BrowserHtml[source]

Bases: SelectableMixin, str

HTML returned by a web browser, i.e. snapshot of the DOM tree in HTML format.

css(query) SelectorList

A shortcut to .selector.css().

property selector: Selector

Cached instance of parsel.selector.Selector.

xpath(query, **kwargs) SelectorList

A shortcut to .selector.xpath().

class web_poet.page_inputs.http.RequestUrl(*args, **kwargs)

Bases: RequestUrl

class web_poet.page_inputs.http.ResponseUrl(*args, **kwargs)

Bases: ResponseUrl

class web_poet.page_inputs.http.HttpRequestBody[source]

Bases: bytes

A container for holding the raw HTTP request body in bytes format.

class web_poet.page_inputs.http.HttpResponseBody[source]

Bases: bytes

A container for holding the raw HTTP response body in bytes format.

declared_encoding() Optional[str][source]

Return the encoding specified in meta tags in the html body, or None if no suitable encoding was found

json()[source]

Deserialize a JSON document to a Python object.

class web_poet.page_inputs.http.HttpRequestHeaders[source]

Bases: _HttpHeaders

A container for holding the HTTP request headers.

It’s able to accept instantiation via an Iterable of Tuples:

>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")]
>>> HttpRequestHeaders(pairs)
<HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

It’s also accepts a mapping of key-value pairs as well:

>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"}
>>> headers = HttpRequestHeaders(pairs)
>>> headers
<HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

Note that this also supports case insensitive header-key lookups:

>>> headers.get("content-encoding")
'gzip'
>>> headers.get("Content-Length")
'648'

These are just a few of the functionalities it inherits from multidict.CIMultiDict. For more info on its other features, read the API spec of multidict.CIMultiDict.

copy()

Return a copy of itself.

classmethod from_name_value_pairs(arg: List[Dict]) T_headers

An alternative constructor for instantiation using a List[Dict] where the ‘key’ is the header name while the ‘value’ is the header value.

>>> pairs = [
...     {"name": "Content-Encoding", "value": "gzip"},
...     {"name": "content-length", "value": "648"}
... ]
>>> headers = _HttpHeaders.from_name_value_pairs(pairs)
>>> headers
<_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
class web_poet.page_inputs.http.HttpResponseHeaders[source]

Bases: _HttpHeaders

A container for holding the HTTP response headers.

It’s able to accept instantiation via an Iterable of Tuples:

>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")]
>>> HttpResponseHeaders(pairs)
<HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

It’s also accepts a mapping of key-value pairs as well:

>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"}
>>> headers = HttpResponseHeaders(pairs)
>>> headers
<HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>

Note that this also supports case insensitive header-key lookups:

>>> headers.get("content-encoding")
'gzip'
>>> headers.get("Content-Length")
'648'

These are just a few of the functionalities it inherits from multidict.CIMultiDict. For more info on its other features, read the API spec of multidict.CIMultiDict.

classmethod from_bytes_dict(arg: Dict[AnyStr, Union[AnyStr, List, Tuple[AnyStr, ...]]], encoding: str = 'utf-8') T_headers[source]

An alternative constructor for instantiation where the header-value pairs could be in raw bytes form.

This supports multiple header values in the form of List[bytes] and Tuple[bytes]] alongside a plain bytes value. A value in str also works and wouldn’t break the decoding process at all.

By default, it converts the bytes value using “utf-8”. However, this can easily be overridden using the encoding parameter.

>>> raw_values = {
...     b"Content-Encoding": [b"gzip", b"br"],
...     b"Content-Type": [b"text/html"],
...     b"content-length": b"648",
... }
>>> headers = HttpResponseHeaders.from_bytes_dict(raw_values)
>>> headers
<HttpResponseHeaders('Content-Encoding': 'gzip', 'Content-Encoding': 'br', 'Content-Type': 'text/html', 'content-length': '648')>
declared_encoding() Optional[str][source]

Return encoding detected from the Content-Type header, or None if encoding is not found

copy()

Return a copy of itself.

classmethod from_name_value_pairs(arg: List[Dict]) T_headers

An alternative constructor for instantiation using a List[Dict] where the ‘key’ is the header name while the ‘value’ is the header value.

>>> pairs = [
...     {"name": "Content-Encoding", "value": "gzip"},
...     {"name": "content-length", "value": "648"}
... ]
>>> headers = _HttpHeaders.from_name_value_pairs(pairs)
>>> headers
<_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
class web_poet.page_inputs.http.HttpRequest(url: Union[str, _Url], *, method: str = 'GET', headers=NOTHING, body=NOTHING)[source]

Bases: object

Represents a generic HTTP request used by other functionalities in web-poet like HttpClient.

url: RequestUrl
method: str
headers: HttpRequestHeaders
body: HttpRequestBody
urljoin(url: Union[str, RequestUrl, ResponseUrl]) RequestUrl[source]

Return url as an absolute URL.

If url is relative, it is made absolute relative to url.

class web_poet.page_inputs.http.HttpResponse(url: Union[str, _Url], body, *, status: Optional[int] = None, headers=NOTHING, encoding: Optional[str] = None)[source]

Bases: SelectableMixin

A container for the contents of a response, downloaded directly using an HTTP client.

url should be a URL of the response (after all redirects), not a URL of the request, if possible.

body contains the raw HTTP response body.

The following are optional since it would depend on the source of the HttpResponse if these are available or not. For example, the responses could simply come off from a local HTML file which doesn’t contain headers and status.

status should represent the int status code of the HTTP response.

headers should contain the HTTP response headers.

encoding encoding of the response. If None (default), encoding is auto-detected from headers and body content.

url: ResponseUrl
body: HttpResponseBody
status: Optional[int]
headers: HttpResponseHeaders
property text: str

Content of the HTTP body, converted to unicode using the detected encoding of the response, according to the web browser rules (respecting Content-Type header, etc.)

property encoding

Encoding of the response

json()[source]

Deserialize a JSON document to a Python object.

urljoin(url: Union[str, RequestUrl, ResponseUrl]) RequestUrl[source]

Return url as an absolute URL.

If url is relative, it is made absolute relative to the base URL of self.

css(query) SelectorList

A shortcut to .selector.css().

property selector: Selector

Cached instance of parsel.selector.Selector.

xpath(query, **kwargs) SelectorList

A shortcut to .selector.xpath().

class web_poet.page_inputs.page_params.PageParams[source]

Bases: dict

Container class that could contain any arbitrary data to be passed into a Page Object.

Note that this is simply a subclass of Python’s dict.

class web_poet.page_inputs.client.HttpClient(request_downloader: Optional[Callable] = None)[source]

Async HTTP client to be used in Page Objects.

See Additional Requests for the usage information.

HttpClient doesn’t make HTTP requests on itself. It uses either the request function assigned to the web_poet.request_downloader_var contextvar, or a function passed via request_downloader argument of the __init__() method.

Either way, this function should be an async def function which receives an HttpRequest instance, and either returns a HttpResponse instance, or raises a subclass of HttpError. You can read more in the Providing the Downloader documentation.

async request(url: Union[str, _Url], *, method: str = 'GET', headers: Optional[Union[Dict[str, str], HttpRequestHeaders]] = None, body: Optional[Union[bytes, HttpRequestBody]] = None, allow_status: Optional[List[Union[str, int]]] = None) HttpResponse[source]

This is a shortcut for creating an HttpRequest instance and executing that request.

HttpRequestError is raised for connection errors, connection and read timeouts, etc.

An HttpResponse instance is returned for successful responses in the 100-3xx status code range.

Otherwise, an exception of type HttpResponseError is raised.

Rasing HttpResponseError can be suppressed for certain status codes using the allow_status param - it is a list of status code values for which HttpResponse should be returned instead of raising HttpResponseError.

There is a special “*” allow_status value which allows any status code.

There is no need to include 100-3xx status codes in allow_status, because HttpResponseError is not raised for them.

async get(url: Union[str, _Url], *, headers: Optional[Union[Dict[str, str], HttpRequestHeaders]] = None, allow_status: Optional[List[Union[str, int]]] = None) HttpResponse[source]

Similar to request() but peforming a GET request.

async post(url: Union[str, _Url], *, headers: Optional[Union[Dict[str, str], HttpRequestHeaders]] = None, body: Optional[Union[bytes, HttpRequestBody]] = None, allow_status: Optional[List[Union[str, int]]] = None) HttpResponse[source]

Similar to request() but performing a POST request.

async execute(request: HttpRequest, *, allow_status: Optional[List[Union[str, int]]] = None) HttpResponse[source]

Execute the specified HttpRequest instance using the request implementation configured in the HttpClient instance.

HttpRequestError is raised for connection errors, connection and read timeouts, etc.

HttpResponse instance is returned for successful responses in the 100-3xx status code range.

Otherwise, an exception of type HttpResponseError is raised.

Rasing HttpResponseError can be suppressed for certain status codes using the allow_status param - it is a list of status code values for which HttpResponse should be returned instead of raising HttpResponseError.

There is a special “*” allow_status value which allows any status code.

There is no need to include 100-3xx status codes in allow_status, because HttpResponseError is not raised for them.

async batch_execute(*requests: HttpRequest, return_exceptions: bool = False, allow_status: Optional[List[Union[str, int]]] = None) List[Union[HttpResponse, Exception]][source]

Similar to execute() but accepts a collection of HttpRequest instances that would be batch executed.

The order of the HttpResponses would correspond to the order of HttpRequest passed.

If any of the HttpRequest raises an exception upon execution, the exception is raised.

To prevent this, the actual exception can be returned alongside any successful HttpResponse. This enables salvaging any usable responses despite any possible failures. This can be done by setting True to the return_exceptions parameter.

Like execute(), HttpResponseError will be raised for responses with status codes in the 400-5xx range. The allow_status parameter could be used the same way here to prevent these exceptions from being raised.

You can omit allow_status="*" if you’re passing return_exceptions=True. However, it would be returning HttpResponseError instead of HttpResponse.

Lastly, a HttpRequestError may be raised on cases like connection errors, connection and read timeouts, etc.

Pages

class web_poet.pages.Injectable[source]

Base Page Object class, which all Page Objects should inherit from (probably through Injectable subclasses).

Frameworks which are using web-poet Page Objects should use is_injectable() function to detect if an object is an Injectable, and if an object is injectable, allow building it automatically through dependency injection, using https://github.com/scrapinghub/andi library.

Instead of inheriting you can also use Injectable.register(MyWebPage). Injectable.register can also be used as a decorator.

web_poet.pages.is_injectable(cls: Any) bool[source]

Return True if cls is a class which inherits from Injectable.

class web_poet.pages.ItemPage[source]

Bases: Injectable, ABC

Base Page Object with a required to_item() method. Make sure you’re creating Page Objects with to_item methods if their main goal is to extract a single data record from a web page.

abstract to_item()[source]

Extract an item from a web page

class web_poet.pages.WebPage(response: HttpResponse)[source]

Bases: Injectable, ResponseShortcutsMixin

Base Page Object which requires HttpResponse and provides XPath / CSS shortcuts.

Use this class as a base class for Page Objects which work on HTML downloaded using an HTTP client directly.

response: HttpResponse
property base_url: str

Return the base url of the given response

css(query) SelectorList

A shortcut to .selector.css().

property html

Shortcut to HTML Response’s content.

property selector: Selector

Cached instance of parsel.selector.Selector.

property url

Shortcut to HTML Response’s URL, as a string.

urljoin(url: str) str

Convert url to absolute, taking in account url and baseurl of the response

xpath(query, **kwargs) SelectorList

A shortcut to .selector.xpath().

class web_poet.pages.ItemWebPage(response: HttpResponse)[source]

Bases: WebPage, ItemPage

WebPage that requires the to_item() method to be implemented.

Mixins

class web_poet.mixins.ResponseShortcutsMixin[source]

Common shortcut methods for working with HTML responses. This mixin could be used with Page Object base classes.

It requires “response” attribute to be present.

property url

Shortcut to HTML Response’s URL, as a string.

property html

Shortcut to HTML Response’s content.

property base_url: str

Return the base url of the given response

urljoin(url: str) str[source]

Convert url to absolute, taking in account url and baseurl of the response

Requests

web_poet.requests.request_downloader_var: ContextVar = <ContextVar name='request_downloader'>

Frameworks that wants to support additional requests in web-poet should set the appropriate implementation of request_downloader_var for requesting data.

Exceptions

Core Exceptions

These exceptions are tied to how web-poet operates.

exception web_poet.exceptions.core.RequestDownloaderVarError[source]

The web_poet.request_downloader_var had its contents accessed but there wasn’t any value set during the time requests are executed.

See the documentation section about setting up the contextvars to learn more about this.

exception web_poet.exceptions.core.Retry[source]

The page object found that the input data is partial or empty, and a request retry may provide better input.

See Retries.

HTTP Exceptions

These are exceptions pertaining to common issues faced when executing HTTP operations.

exception web_poet.exceptions.http.HttpError(msg: Optional[str] = None, request: Optional[HttpRequest] = None)[source]

Bases: OSError

Indicates that an exception has occurred when handling an HTTP operation.

This is used as a base class for more specific errors and could be vague since it could denote problems either in the HTTP Request or Response.

For more specific errors, it would be better to use HttpRequestError and HttpResponseError.

Parameters

request (HttpRequest) – The HttpRequest instance that was used.

exception web_poet.exceptions.http.HttpRequestError(msg: Optional[str] = None, request: Optional[HttpRequest] = None)[source]

Bases: HttpError

Indicates that an exception has occurred when the HTTP Request was being handled.

Parameters

request (HttpRequest) – The HttpRequest instance that was used.

exception web_poet.exceptions.http.HttpResponseError(msg: Optional[str] = None, response: Optional[HttpResponse] = None, request: Optional[HttpRequest] = None)[source]

Bases: HttpError

Indicates that an exception has occurred when the HTTP Response was received.

For responses that are in the status code 100-3xx range, this exception shouldn’t be raised at all. However, for responses in the 400-5xx, this will be raised by web-poet.

Note

Frameworks implementing web-poet should NOT raise this exception.

This exception is raised by web-poet itself, based on allow_status parameter found in the methods of HttpClient.

Parameters

Overrides

See the tutorial section on Overrides for more context about its use cases and some examples.

web_poet.handle_urls(include: Union[str, Iterable[str]], *, overrides: Callable, exclude: Optional[Union[str, Iterable[str]]] = None, priority: int = 500, **kwargs)

Class decorator that indicates that the decorated Page Object should be used instead of the overridden one for a particular set the URLs.

The Page Object that is overridden is declared using the overrides parameter.

The override mechanism only works on certain URLs that match the include and exclude parameters. See the documentation of the url-matcher package for more information about them.

Any extra parameters are stored as meta information that can be later used.

Parameters
  • include – The URLs that should be handled by the decorated Page Object.

  • overrides – The Page Object that should be replaced.

  • exclude – The URLs over which the override should not happen.

  • priority – The resolution priority in case of conflicting rules. A conflict happens when the include, override, and exclude parameters are the same. If so, the highest priority will be chosen.

class web_poet.overrides.OverrideRule(for_patterns: ~url_matcher.matcher.Patterns, use: ~typing.Callable, instead_of: ~typing.Callable, meta: ~typing.Dict[str, ~typing.Any] = <factory>)[source]

A single override rule that specifies when a Page Object should be used in lieu of another.

This is instantiated when using the web_poet.handle_urls() decorator. It’s also being returned as a List[OverrideRule] when calling the web_poet.default_registry’s get_overrides() method.

You can access any of its attributes:

  • for_patterns - contains the list of URL patterns associated with this rule. You can read the API documentation of the url-matcher package for more information about the patterns.

  • use - The Page Object that will be used.

  • instead_of - The Page Object that will be replaced.

  • meta - Any other information you may want to store. This doesn’t do anything for now but may be useful for future API updates.

Tip

The OverrideRule is also hashable. This makes it easy to store unique rules and identify any duplicates.

class web_poet.overrides.PageObjectRegistry[source]

This contains the mapping rules that associates the Page Objects available for a given URL matching rule.

Note that it’s simply a dict subclass with added functionalities on storing, retrieving, and searching for the OverrideRule instances. The value represents the OverrideRule instance from which the Page Object in the key is allowed to be used. Since it’s essentially a dict, you can use any dict operations with it.

web-poet already provides a default Registry named default_registry for convenience. It can be directly accessed via:

from web_poet import handle_urls, default_registry, ItemWebPage

@handle_urls("example.com", overrides=ProductPageObject)
class ExampleComProductPage(ItemWebPage):
    ...

override_rules = default_registry.get_overrides()

Notice that the @handle_urls that we’re using is a part of the default_registry. This provides a shorter and quicker way to interact with the built-in default PageObjectRegistry instead of writing the longer @default_registry.handle_urls.

Note

It is encouraged to simply use and import the already existing registry via from web_poet import default_registry instead of creating your own PageObjectRegistry instance. Using multiple registries would be unwieldy in most cases.

However, it might be applicable in certain scenarios like storing custom rules to separate it from the default_registry. This example from the tutorial section may provide some context.

classmethod from_override_rules(rules: List[OverrideRule]) PageObjectRegistryTV[source]

An alternative constructor for creating a PageObjectRegistry instance by accepting a list of OverrideRule.

This is useful in cases wherein you need to store some selected rules from multiple external packages.

get_overrides() List[OverrideRule][source]

Returns all of the OverrideRule that were declared using the @handle_urls annotation.

Warning

Remember to consider calling consume_modules() beforehand to recursively import all submodules which contains the @handle_urls annotations from external Page Objects.

search_overrides(**kwargs) List[OverrideRule][source]

Returns any OverrideRule that has any of its attributes match the rules inside the registry.

Sample usage:

rules = registry.search_overrides(use=ProductPO, instead_of=GenericPO)
print(len(rules))  # 1
web_poet.overrides.consume_modules(*modules: str) None[source]

This recursively imports all packages/modules so that the @handle_urls annotation are properly discovered and imported.

Let’s take a look at an example:

# FILE: my_page_obj_project/load_rules.py

from web_poet import default_registry, consume_modules

consume_modules("other_external_pkg.po", "another_pkg.lib")
rules = default_registry.get_overrides()

For this case, the OverrideRule are coming from:

  • my_page_obj_project (since it’s the same module as the file above)

  • other_external_pkg.po

  • another_pkg.lib

  • any other modules that was imported in the same process inside the packages/modules above.

If the default_registry had other @handle_urls annotations outside of the packages/modules listed above, then the corresponding OverrideRule won’t be returned. Unless, they were recursively imported in some way similar to consume_modules().