Inputs
Page object classes, in their __init__
method,
must define input parameters with type hints pointing to input classes.
Those input classes may be:
Other page object classes.
Item classes, when using a framework that can provide item classes.
Any other class that subclasses
Injectable
or is registered or decorated withInjectable.register
.
Based on the target URL and parameter type hints, frameworks automatically build the required objects at run time, and pass
them to the __init__
method of the corresponding page object class.
For example, if a page object class has an __init__
parameter of type
HttpResponse
, and the target URL is
https://example.com, your framework would send an HTTP request to
https://example.com, download the response, build an
HttpResponse
object with the response data,
and pass it to the __init__
method of the page object class being used.
Built-in input classes
Warning
Not all frameworks support all web-poet built-in input classes.
The web_poet.page_inputs
module defines multiple classes that you can
define as inputs for a page object class, including:
HttpResponse
, a complete HTTP response, including URL, headers, and body. This is the most common input for a page object class. See Working with HttpResponse.HttpClient
, to send additional requests.RequestUrl
, the target URL before following redirects. Useful, for example, to skip the target URL download, and instead useHttpClient
to send a custom request based on parts of the target URL.PageParams
, to receive data from the crawling code.Stats
, to write key-value data pairs during parsing that you can inspect later, e.g. for debugging purposes.BrowserResponse
, which includes URL, status code andBrowserHtml
of a rendered web page.AnyResponse
, which either holdsBrowserResponse
orHttpResponse
as the.response
instance, depending on which one is available or is more appropriate.
Working with HttpResponse
HttpResponse
has many attributes and methods.
To get the entire response body, you can use body
for
the raw bytes
, text
for the str
(decoded with the detected encoding
), or json()
to load a JSON response as a Python data structure:
>>> response.body
b'{"foo": "bar"}'
>>> response.text
'{"foo": "bar"}'
>>> response.json()
{'foo': 'bar'}
There are also methods to select content from responses: jmespath()
for JSON and css()
and
xpath()
for HTML and XML:
>>> response.jmespath("foo")
[<Selector query='foo' data='bar'>]
>>> response.css("h1::text")
[<Selector query='descendant-or-self::h1/text()' data='Title'>]
>>> response.xpath("//h1/text()")
[<Selector query='//h1/text()' data='Title'>]
Custom input classes
You may define your own input classes if you are using a framework that supports it.
However, note that custom input classes may make your page object classes less portable across frameworks.