API reference¶
Page Inputs¶
- class web_poet.page_inputs.browser.BrowserHtml[source]¶
Bases:
SelectableMixin
,str
HTML returned by a web browser, i.e. snapshot of the DOM tree in HTML format.
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- class web_poet.page_inputs.browser.BrowserResponse(url: Union[str, _Url], html, *, status: Optional[int] = None)[source]¶
Bases:
SelectableMixin
,UrlShortcutsMixin
Browser response: url, HTML and status code.
url
should be browser’s window.location, not a URL of the request, if possible.html
contains the HTML returned by the browser, i.e. a snapshot of DOM tree in HTML format.The following are optional since it would depend on the source of the
BrowserResponse
if these are available or not:status
should represent the int status code of the HTTP response.- url: ResponseUrl¶
- html: BrowserHtml¶
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- urljoin(url: Union[str, RequestUrl, ResponseUrl]) RequestUrl ¶
Return url as an absolute URL.
If url is relative, it is made absolute relative to the base URL of self.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- class web_poet.page_inputs.client.HttpClient(request_downloader: Optional[Callable] = None, *, save_responses: bool = False, return_only_saved_responses: bool = False, responses: Optional[Iterable[_SavedResponseData]] = None)[source]¶
Async HTTP client to be used in Page Objects.
See Additional requests for the usage information.
HttpClient doesn’t make HTTP requests on itself. It uses either the request function assigned to the
web_poet.request_downloader_var
contextvar
, or a function passed viarequest_downloader
argument of the__init__()
method.Either way, this function should be an
async def
function which receives anHttpRequest
instance, and either returns aHttpResponse
instance, or raises a subclass ofHttpError
. You can read more in the Providing the Downloader documentation.- async request(url: Union[str, _Url], *, method: str = 'GET', headers: Optional[Union[Dict[str, str], HttpRequestHeaders]] = None, body: Optional[Union[bytes, HttpRequestBody]] = None, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) HttpResponse [source]¶
This is a shortcut for creating an
HttpRequest
instance and executing that request.HttpRequestError
is raised for connection errors, connection and read timeouts, etc.An
HttpResponse
instance is returned for successful responses in the100-3xx
status code range.Otherwise, an exception of type
HttpResponseError
is raised.Rasing
HttpResponseError
can be suppressed for certain status codes using theallow_status
param - it is a list of status code values for whichHttpResponse
should be returned instead of raisingHttpResponseError
.There is a special “*”
allow_status
value which allows any status code.There is no need to include
100-3xx
status codes inallow_status
, becauseHttpResponseError
is not raised for them.
- async get(url: Union[str, _Url], *, headers: Optional[Union[Dict[str, str], HttpRequestHeaders]] = None, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) HttpResponse [source]¶
Similar to
request()
but peforming aGET
request.
- async post(url: Union[str, _Url], *, headers: Optional[Union[Dict[str, str], HttpRequestHeaders]] = None, body: Optional[Union[bytes, HttpRequestBody]] = None, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) HttpResponse [source]¶
Similar to
request()
but performing aPOST
request.
- async execute(request: HttpRequest, *, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) HttpResponse [source]¶
Execute the specified
HttpRequest
instance using the request implementation configured in theHttpClient
instance.HttpRequestError
is raised for connection errors, connection and read timeouts, etc.HttpResponse
instance is returned for successful responses in the100-3xx
status code range.Otherwise, an exception of type
HttpResponseError
is raised.Rasing
HttpResponseError
can be suppressed for certain status codes using theallow_status
param - it is a list of status code values for whichHttpResponse
should be returned instead of raisingHttpResponseError
.There is a special “*”
allow_status
value which allows any status code.There is no need to include
100-3xx
status codes inallow_status
, becauseHttpResponseError
is not raised for them.
- async batch_execute(*requests: HttpRequest, return_exceptions: bool = False, allow_status: Optional[Union[str, int, List[Union[str, int]]]] = None) List[Union[HttpResponse, HttpResponseError]] [source]¶
Similar to
execute()
but accepts a collection ofHttpRequest
instances that would be batch executed.The order of the
HttpResponses
would correspond to the order ofHttpRequest
passed.If any of the
HttpRequest
raises an exception upon execution, the exception is raised.To prevent this, the actual exception can be returned alongside any successful
HttpResponse
. This enables salvaging any usable responses despite any possible failures. This can be done by settingTrue
to thereturn_exceptions
parameter.Like
execute()
,HttpResponseError
will be raised for responses with status codes in the400-5xx
range. Theallow_status
parameter could be used the same way here to prevent these exceptions from being raised.You can omit
allow_status="*"
if you’re passingreturn_exceptions=True
. However, it would be returningHttpResponseError
instead ofHttpResponse
.Lastly, a
HttpRequestError
may be raised on cases like connection errors, connection and read timeouts, etc.
- class web_poet.page_inputs.http.RequestUrl(*args, **kwargs)¶
Bases:
RequestUrl
- class web_poet.page_inputs.http.ResponseUrl(*args, **kwargs)¶
Bases:
ResponseUrl
- class web_poet.page_inputs.http.HttpRequestBody[source]¶
Bases:
bytes
A container for holding the raw HTTP request body in bytes format.
- class web_poet.page_inputs.http.HttpResponseBody[source]¶
Bases:
bytes
A container for holding the raw HTTP response body in bytes format.
- class web_poet.page_inputs.http.HttpRequestHeaders[source]¶
Bases:
_HttpHeaders
A container for holding the HTTP request headers.
It’s able to accept instantiation via an Iterable of Tuples:
>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")] >>> HttpRequestHeaders(pairs) <HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
It’s also accepts a mapping of key-value pairs as well:
>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"} >>> headers = HttpRequestHeaders(pairs) >>> headers <HttpRequestHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
Note that this also supports case insensitive header-key lookups:
>>> headers.get("content-encoding") 'gzip' >>> headers.get("Content-Length") '648'
These are just a few of the functionalities it inherits from
multidict.CIMultiDict
. For more info on its other features, read the API spec ofmultidict.CIMultiDict
.- copy()¶
Return a copy of itself.
- classmethod from_bytes_dict(arg: Dict[AnyStr, Union[AnyStr, List, Tuple[AnyStr, ...]]], encoding: str = 'utf-8') T_headers ¶
An alternative constructor for instantiation where the header-value pairs could be in raw bytes form.
This supports multiple header values in the form of
List[bytes]
andTuple[bytes]]
alongside a plainbytes
value. A value instr
also works and wouldn’t break the decoding process at all.By default, it converts the
bytes
value using “utf-8”. However, this can easily be overridden using theencoding
parameter.>>> raw_values = { ... b"Content-Encoding": [b"gzip", b"br"], ... b"Content-Type": [b"text/html"], ... b"content-length": b"648", ... } >>> headers = _HttpHeaders.from_bytes_dict(raw_values) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'Content-Encoding': 'br', 'Content-Type': 'text/html', 'content-length': '648')>
- classmethod from_name_value_pairs(arg: List[Dict]) T_headers ¶
An alternative constructor for instantiation using a
List[Dict]
where the ‘key’ is the header name while the ‘value’ is the header value.>>> pairs = [ ... {"name": "Content-Encoding", "value": "gzip"}, ... {"name": "content-length", "value": "648"} ... ] >>> headers = _HttpHeaders.from_name_value_pairs(pairs) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
- class web_poet.page_inputs.http.HttpResponseHeaders[source]¶
Bases:
_HttpHeaders
A container for holding the HTTP response headers.
It’s able to accept instantiation via an Iterable of Tuples:
>>> pairs = [("Content-Encoding", "gzip"), ("content-length", "648")] >>> HttpResponseHeaders(pairs) <HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
It’s also accepts a mapping of key-value pairs as well:
>>> pairs = {"Content-Encoding": "gzip", "content-length": "648"} >>> headers = HttpResponseHeaders(pairs) >>> headers <HttpResponseHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
Note that this also supports case insensitive header-key lookups:
>>> headers.get("content-encoding") 'gzip' >>> headers.get("Content-Length") '648'
These are just a few of the functionalities it inherits from
multidict.CIMultiDict
. For more info on its other features, read the API spec ofmultidict.CIMultiDict
.- declared_encoding() Optional[str] [source]¶
Return encoding detected from the Content-Type header, or None if encoding is not found
- copy()¶
Return a copy of itself.
- classmethod from_bytes_dict(arg: Dict[AnyStr, Union[AnyStr, List, Tuple[AnyStr, ...]]], encoding: str = 'utf-8') T_headers ¶
An alternative constructor for instantiation where the header-value pairs could be in raw bytes form.
This supports multiple header values in the form of
List[bytes]
andTuple[bytes]]
alongside a plainbytes
value. A value instr
also works and wouldn’t break the decoding process at all.By default, it converts the
bytes
value using “utf-8”. However, this can easily be overridden using theencoding
parameter.>>> raw_values = { ... b"Content-Encoding": [b"gzip", b"br"], ... b"Content-Type": [b"text/html"], ... b"content-length": b"648", ... } >>> headers = _HttpHeaders.from_bytes_dict(raw_values) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'Content-Encoding': 'br', 'Content-Type': 'text/html', 'content-length': '648')>
- classmethod from_name_value_pairs(arg: List[Dict]) T_headers ¶
An alternative constructor for instantiation using a
List[Dict]
where the ‘key’ is the header name while the ‘value’ is the header value.>>> pairs = [ ... {"name": "Content-Encoding", "value": "gzip"}, ... {"name": "content-length", "value": "648"} ... ] >>> headers = _HttpHeaders.from_name_value_pairs(pairs) >>> headers <_HttpHeaders('Content-Encoding': 'gzip', 'content-length': '648')>
- class web_poet.page_inputs.http.HttpRequest(url: Union[str, _Url], *, method: str = 'GET', headers=_Nothing.NOTHING, body=_Nothing.NOTHING)[source]¶
Bases:
object
Represents a generic HTTP request used by other functionalities in web-poet like
HttpClient
.Tip
To build a request to submit an HTML form, use the form2request library, which provides integration with web-poet.
- url: RequestUrl¶
- headers: HttpRequestHeaders¶
- body: HttpRequestBody¶
- class web_poet.page_inputs.http.HttpResponse(url: Union[str, _Url], body, *, status: Optional[int] = None, headers=_Nothing.NOTHING, encoding: Optional[str] = None)[source]¶
Bases:
SelectableMixin
,UrlShortcutsMixin
A container for the contents of a response, downloaded directly using an HTTP client.
url
should be a URL of the response (after all redirects), not a URL of the request, if possible.body
contains the raw HTTP response body.The following are optional since it would depend on the source of the
HttpResponse
if these are available or not. For example, the responses could simply come off from a local HTML file which doesn’t containheaders
andstatus
.status
should represent the int status code of the HTTP response.headers
should contain the HTTP response headers.encoding
encoding of the response. If None (default), encoding is auto-detected from headers and body content.- url: ResponseUrl¶
- body: HttpResponseBody¶
- headers: HttpResponseHeaders¶
- property text: str¶
Content of the HTTP body, converted to unicode using the detected encoding of the response, according to the web browser rules (respecting Content-Type header, etc.)
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- urljoin(url: Union[str, RequestUrl, ResponseUrl]) RequestUrl ¶
Return url as an absolute URL.
If url is relative, it is made absolute relative to the base URL of self.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- web_poet.page_inputs.http.request_fingerprint(req: HttpRequest) str [source]¶
Return the fingerprint of the request.
- class web_poet.page_inputs.response.AnyResponse(response: Union[BrowserResponse, HttpResponse])[source]¶
Bases:
SelectableMixin
,UrlShortcutsMixin
A container that holds either
BrowserResponse
orHttpResponse
.- response: Union[BrowserResponse, HttpResponse]¶
- property url: ResponseUrl¶
URL of the response.
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- urljoin(url: Union[str, RequestUrl, ResponseUrl]) RequestUrl ¶
Return url as an absolute URL.
If url is relative, it is made absolute relative to the base URL of self.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- class web_poet.page_inputs.page_params.PageParams[source]¶
Bases:
dict
Container class that could contain any arbitrary data to be passed into a Page Object.
Note that this is simply a subclass of Python’s
dict
.
- class web_poet.page_inputs.stats.StatCollector[source]¶
Bases:
ABC
Base class for web-poet to implement the storing of data written through
Stats
.
- class web_poet.page_inputs.stats.DummyStatCollector[source]¶
Bases:
StatCollector
StatCollector
implementation that does not persist stats. It is used when running automatic tests, where stat storage is not necessary.
- class web_poet.page_inputs.stats.Stats(stat_collector=None)[source]¶
Bases:
object
Page input class to write key-value data pairs during parsing that you can inspect later. See Stats.
Stats can be set to a fixed value or, if numeric, incremented.
Stats are write-only.
Storage and read access of stats depends on the web-poet framework that you are using. Check the documentation of your web-poet framework to find out if it supports stats, and if so, how to read stored stats.
Pages¶
- class web_poet.pages.Injectable[source]¶
Bases:
ABC
,FieldsMixin
Base Page Object class, which all Page Objects should inherit from (probably through Injectable subclasses).
Frameworks which are using
web-poet
Page Objects should useis_injectable()
function to detect if an object is an Injectable, and if an object is injectable, allow building it automatically through dependency injection, using https://github.com/scrapinghub/andi library.Instead of inheriting you can also use
Injectable.register(MyWebPage)
.Injectable.register
can also be used as a decorator.
- web_poet.pages.is_injectable(cls: Any) bool [source]¶
Return True if
cls
is a class which inherits fromInjectable
.
- class web_poet.pages.ItemPage[source]¶
Bases:
Extractor
[ItemT
],Injectable
Base class for page objects.
- class web_poet.pages.WebPage(response: HttpResponse)[source]¶
Bases:
ItemPage
[ItemT
],ResponseShortcutsMixin
Base Page Object which requires
HttpResponse
and provides XPath / CSS shortcuts.- response: HttpResponse¶
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- async to_item() ItemT ¶
Extract an item from a web page
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
- class web_poet.pages.Returns[source]¶
Bases:
Generic
[ItemT
]Inherit from this generic mixin to change the item class used by
ItemPage
- class web_poet.pages.Extractor[source]¶
Bases:
Returns
[ItemT
],FieldsMixin
Base class for field support.
Mixins¶
- class web_poet.mixins.ResponseShortcutsMixin(*args, **kwargs)[source]¶
Common shortcut methods for working with HTML responses. This mixin could be used with Page Object base classes.
It requires “response” attribute to be present.
- urljoin(url: str) str [source]¶
Convert url to absolute, taking in account url and baseurl of the response
- css(query) SelectorList ¶
A shortcut to
.selector.css()
.
- jmespath(query: str, **kwargs) SelectorList ¶
A shortcut to
.selector.jmespath()
.
- property selector: Selector¶
Cached instance of
parsel.selector.Selector
.
- xpath(query, **kwargs) SelectorList ¶
A shortcut to
.selector.xpath()
.
Requests¶
- web_poet.requests.request_downloader_var: ContextVar = <ContextVar name='request_downloader'>¶
Frameworks that wants to support additional requests in
web-poet
should set the appropriate implementation ofrequest_downloader_var
for requesting data.
Exceptions¶
Core Exceptions¶
These exceptions are tied to how web-poet operates.
- exception web_poet.exceptions.core.RequestDownloaderVarError[source]¶
The
web_poet.request_downloader_var
had its contents accessed but there wasn’t any value set during the time requests are executed.See the documentation section about setting up the contextvars to learn more about this.
- exception web_poet.exceptions.core.PageObjectAction[source]¶
Base class for exceptions that can be raised from a page object to indicate something to be done about that page object.
- exception web_poet.exceptions.core.Retry[source]¶
The page object found that the input data is partial or empty, and a request retry may provide better input.
- exception web_poet.exceptions.core.UseFallback[source]¶
The page object cannot extract data from the input, but the input seems valid, so an alternative data extraction implementation for the same item type may succeed.
- exception web_poet.exceptions.core.NoSavedHttpResponse(msg: Optional[str] = None, request: Optional[HttpRequest] = None)[source]¶
Indicates that there is no saved response for this request.
Can only be raised when a
HttpClient
instance is used to get saved responses.- Parameters
request (HttpRequest) – The
HttpRequest
instance that was used.
HTTP Exceptions¶
These are exceptions pertaining to common issues faced when executing HTTP operations.
- exception web_poet.exceptions.http.HttpError(msg: Optional[str] = None, request: Optional[HttpRequest] = None)[source]¶
Bases:
OSError
Indicates that an exception has occurred when handling an HTTP operation.
This is used as a base class for more specific errors and could be vague since it could denote problems either in the HTTP Request or Response.
For more specific errors, it would be better to use
HttpRequestError
andHttpResponseError
.- Parameters
request (HttpRequest) – Request that triggered the exception.
- request: Optional[HttpRequest]¶
Request that triggered the exception.
- exception web_poet.exceptions.http.HttpRequestError(msg: Optional[str] = None, request: Optional[HttpRequest] = None)[source]¶
Bases:
HttpError
Indicates that an exception has occurred when the HTTP Request was being handled.
- Parameters
request (HttpRequest) – The
HttpRequest
instance that was used.
- exception web_poet.exceptions.http.HttpResponseError(msg: Optional[str] = None, response: Optional[HttpResponse] = None, request: Optional[HttpRequest] = None)[source]¶
Bases:
HttpError
Indicates that an exception has occurred when the HTTP Response was received.
For responses that are in the status code
100-3xx range
, this exception shouldn’t be raised at all. However, for responses in the400-5xx
, this will be raised by web-poet.Note
Frameworks implementing web-poet should NOT raise this exception.
This exception is raised by web-poet itself, based on
allow_status
parameter found in the methods ofHttpClient
.- Parameters
request (HttpRequest) – Request that got the response that triggered the exception.
response (HttpResponse) – Response that triggered the exception.
- response: Optional[HttpResponse]¶
Response that triggered the exception.
Apply Rules¶
See Rules for more context about its use cases and some examples.
- web_poet.handle_urls(include: Union[str, Iterable[str]], *, overrides: Optional[Type[ItemPage]] = None, instead_of: Optional[Type[ItemPage]] = None, to_return: Optional[Type] = None, exclude: Optional[Union[str, Iterable[str]]] = None, priority: int = 500, **kwargs)¶
Class decorator that indicates that the decorated Page Object should work for the given URL patterns.
The URL patterns are matched using the
include
andexclude
parameters whilepriority
breaks any ties. See the documentation of the url-matcher package for more information about them.This decorator is able to derive the item class returned by the Page Object. This is important since it marks what type of item the Page Object is capable of returning for the given URL patterns. For certain advanced cases, you can pass a
to_return
parameter which replaces any derived values (though this isn’t generally recommended).Passing another Page Object into the
instead_of
parameter indicates that the decorated Page Object will be used instead of that for the given set of URL patterns. See Rule precedence.Any extra parameters are stored as meta information that can be later used.
- Parameters
include – The URLs that should be handled by the decorated Page Object.
instead_of – The Page Object that should be replaced.
to_return – The item class holding the data returned by the Page Object. This could be omitted as it could be derived from the
Returns[ItemClass]
orItemPage[ItemClass]
declaration of the Page Object. See Items section.exclude – The URLs for which the Page Object should not be applied.
priority – The resolution priority in case of conflicting rules. A conflict happens when the
include
,override
, andexclude
parameters are the same. If so, the highest priority will be chosen.
- class web_poet.rules.ApplyRule(for_patterns: Union[str, Patterns], *, use: Type[ItemPage], instead_of: Optional[Type[ItemPage]] = None, to_return: Optional[Type[Any]] = None, meta: Dict[str, Any] = _Nothing.NOTHING)[source]¶
A rule that primarily applies Page Object and Item overrides for a given URL pattern.
This is instantiated when using the
web_poet.handle_urls()
decorator. It’s also being returned as aList[ApplyRule]
when calling theweb_poet.default_registry
’sget_rules()
method.You can access any of its attributes:
for_patterns
- contains the list of URL patterns associated with this rule. You can read the API documentation of the url-matcher package for more information about the patterns.use
- The Page Object that will be used in cases where the URL pattern represented by thefor_patterns
attribute is matched.instead_of
- (optional) The Page Object that will be replaced with the Page Object specified via theuse
parameter.to_return
- (optional) The item class that the Page Object specified inuse
is capable of returning.meta
- (optional) Any other information you may want to store. This doesn’t do anything for now but may be useful for future API updates.
The main functionality of this class lies in the
instead_of
andto_return
parameters. Should both of these be omitted, thenApplyRule
simply tags which URL patterns the given Page Object defined inuse
is expected to be used on.When
to_return
is not None (e.g.to_return=MyItem
), the Page Object inuse
is declared as capable of returning a certain item class (i.e.MyItem
).When
instead_of
is not None (e.g.instead_of=ReplacedPageObject
), the rule adds an expectation that theReplacedPageObject
wouldn’t be used for the URLs matchingfor_patterns
, since the Page Object inuse
will replace it.If there are multiple rules which match a certain URL, the rule to apply is picked based on the priorities set in
for_patterns
.More information regarding its usage in Rules.
Tip
The
ApplyRule
is also hashable. This makes it easy to store unique rules and identify any duplicates.
- class web_poet.rules.RulesRegistry(*, rules: Optional[Iterable[ApplyRule]] = None)[source]¶
RulesRegistry provides features for storing, retrieving, and searching for the
ApplyRule
instances.web-poet
provides a default Registry nameddefault_registry
for convenience. It can be accessed this way:from web_poet import handle_urls, default_registry, WebPage from my_items import Product @handle_urls("example.com") class ExampleComProductPage(WebPage[Product]): ... rules = default_registry.get_rules()
The
@handle_urls
decorator exposed asweb_poet.handle_urls
is a shortcut fordefault_registry.handle_urls
.Note
It is encouraged to use the
web_poet.default_registry
instead of creating your ownRulesRegistry
instance. Using multiple registries would be unwieldy in most cases.However, it might be applicable in certain scenarios like storing custom rules to separate it from the
default_registry
.- add_rule(rule: ApplyRule) None [source]¶
Registers an
web_poet.rules.ApplyRule
instance.
- classmethod from_override_rules(rules: List[ApplyRule]) RulesRegistryTV [source]¶
Deprecated. Use
RulesRegistry(rules=...)
instead.
- get_rules() List[ApplyRule] [source]¶
Return all the
ApplyRule
that were declared using the@handle_urls
decorator.Note
Remember to consider calling
consume_modules()
beforehand to recursively import all submodules which contains the@handle_urls
decorators from external Page Objects.
- get_overrides() List[ApplyRule] [source]¶
Deprecated, use
get_rules()
instead.
- search(**kwargs) List[ApplyRule] [source]¶
Return any
ApplyRule
from the registry that matches with all the provided attributes.Sample usage:
rules = registry.search(use=ProductPO, instead_of=GenericPO) print(len(rules)) # 1 print(rules[0].use) # ProductPO print(rules[0].instead_of) # GenericPO
- web_poet.rules.consume_modules(*modules: str) None [source]¶
This recursively imports all packages/modules so that the
@handle_urls
decorators are properly discovered and imported.Let’s take a look at an example:
# FILE: my_page_obj_project/load_rules.py from web_poet import default_registry, consume_modules consume_modules("other_external_pkg.po", "another_pkg.lib") rules = default_registry.get_rules()
For this case, the
ApplyRule
are coming from:my_page_obj_project
(since it’s the same module as the file above)other_external_pkg.po
another_pkg.lib
any other modules that was imported in the same process inside the packages/modules above.
If the
default_registry
had other@handle_urls
decorators outside of the packages/modules listed above, then the correspondingApplyRule
won’t be returned. Unless, they were recursively imported in some way similar toconsume_modules()
.
- class web_poet.rules.OverrideRule(*args, **kwargs)¶
- class web_poet.rules.PageObjectRegistry(*args, **kwargs)¶
Fields¶
web_poet.fields
is a module with helpers for putting extraction logic
into separate Page Object methods / properties.
- class web_poet.fields.FieldInfo(name: str, meta: Optional[dict] = None, out: Optional[List[Callable]] = None)[source]¶
Information about a field
- web_poet.fields.field(method=None, *, cached: bool = False, meta: Optional[dict] = None, out: Optional[List[Callable]] = None)[source]¶
Page Object method decorated with
@field
decorator becomes a property, which is then used byItemPage
’s to_item() method to populate a corresponding item attribute.By default, the value is computed on each property access. Use
@field(cached=True)
to cache the property value.The
meta
parameter allows to store arbitrary information for the field, e.g.@field(meta={"expensive": True})
. This information can be later retrieved for all fields using theget_fields_dict()
function.The
out
parameter is an optional list of field processors, which are functions applied to the value of the field before returning it.
- web_poet.fields.get_fields_dict(cls_or_instance) Dict[str, FieldInfo] [source]¶
Return a dictionary with information about the fields defined for the class: keys are field names, and values are
web_poet.fields.FieldInfo
instances.
- async web_poet.fields.item_from_fields(obj, item_cls: ~typing.Type[~web_poet.fields.T] = <class 'dict'>, *, skip_nonitem_fields: bool = False) T [source]¶
Return an item of
item_cls
type, with its attributes populated from theobj
methods decorated withfield
decorator.If
skip_nonitem_fields
is True,@fields
whose names are not amongitem_cls
field names are not passed toitem_cls.__init__
.When
skip_nonitem_fields
is False (default), all@fields
are passed toitem_cls.__init__
, possibly causing exceptions ifitem_cls.__init__
doesn’t support them.
- web_poet.fields.item_from_fields_sync(obj, item_cls: ~typing.Type[~web_poet.fields.T] = <class 'dict'>, *, skip_nonitem_fields: bool = False) T [source]¶
Synchronous version of
item_from_fields()
.
typing.Annotated support¶
- class web_poet.annotated.AnnotatedInstance(result: Any, metadata: Tuple[Any, ...])[source]¶
Wrapper for instances of annotated dependencies.
It is used when both the dependency value and the dependency annotation are needed.
- Parameters
result (Any) – The wrapped dependency instance.
metadata (Tuple[Any, ...]) – The copy of the annotation.
Utils¶
- web_poet.utils.get_fq_class_name(cls: type) str [source]¶
Return the fully qualified name for a type.
>>> from web_poet import Injectable >>> get_fq_class_name(Injectable) 'web_poet.pages.Injectable' >>> from decimal import Decimal >>> get_fq_class_name(Decimal) 'decimal.Decimal'
- web_poet.utils.memoizemethod_noargs(method: CallableT) CallableT [source]¶
Decorator to cache the result of a method (without arguments) using a weak reference to its object.
It is faster than
cached_method()
, and doesn’t add new attributes to the instance, but it doesn’t work if objects are unhashable.
- web_poet.utils.cached_method(method: CallableT) CallableT [source]¶
A decorator to cache method or coroutine method results, so that if it’s called multiple times for the same instance, computation is only done once.
The cache is unbound, but it’s tied to the instance lifetime.
Note
cached_method()
is needed becausefunctools.lru_cache()
doesn’t work well on methods: self is used as a cache key, so a reference to an instance is kept in the cache, and this prevents deallocation of instances.This decorator adds a new private attribute to the instance named
_cached_method_{decorated_method_name}
; make sure the class doesn’t define an attribute of the same name.
- web_poet.utils.as_list(value: Optional[Any]) List[Any] [source]¶
Normalizes the value input as a list.
>>> as_list(None) [] >>> as_list("foo") ['foo'] >>> as_list(123) [123] >>> as_list(["foo", "bar", 123]) ['foo', 'bar', 123] >>> as_list(("foo", "bar", 123)) ['foo', 'bar', 123] >>> as_list(range(5)) [0, 1, 2, 3, 4] >>> def gen(): ... yield 1 ... yield 2 >>> as_list(gen()) [1, 2]
Example framework¶
The web_poet.example
module is a simplified, incomplete example of a
web-poet framework, written as support material for the tutorial.
No part of the web_poet.example
module is intended for production use,
and it may change in a backward-incompatible way at any point in the future.
- web_poet.example.get_item(url: str, item_cls: Type, *, page_params: Optional[Dict[Any, Any]] = None) Any [source]¶
Returns an item built from the specified URL using a page object class from the default registry.
This function is an example of a minimal, incomplete web-poet framework implementation, intended for use in the web-poet tutorial.