Inputs

Page object classes, in their __init__ method, must define input parameters with type hints pointing to input classes.

Those input classes may be:

Based on the target URL and parameter type hints, frameworks automatically build the required objects at run time, and pass them to the __init__ method of the corresponding page object class.

For example, if a page object class has an __init__ parameter of type HttpResponse, and the target URL is https://example.com, your framework would send an HTTP request to https://example.com, download the response, build an HttpResponse object with the response data, and pass it to the __init__ method of the page object class being used.

Built-in input classes

Warning

Not all frameworks support all web-poet built-in input classes.

The web_poet.page_inputs module defines multiple classes that you can define as inputs for a page object class, including:

  • HttpResponse, a complete HTTP response, including URL, headers, and body. This is the most common input for a page object class. See Working with HttpResponse.

  • HttpClient, to send additional requests.

  • RequestUrl, the target URL before following redirects. Useful, for example, to skip the target URL download, and instead use HttpClient to send a custom request based on parts of the target URL.

  • PageParams, to receive data from the crawling code.

  • Stats, to write key-value data pairs during parsing that you can inspect later, e.g. for debugging purposes.

  • BrowserResponse, which includes URL, status code and BrowserHtml of a rendered web page.

  • AnyResponse, which either holds BrowserResponse or HttpResponse as the .response instance, depending on which one is available or is more appropriate.

Working with HttpResponse

HttpResponse has many attributes and methods.

To get the entire response body, you can use body for the raw bytes, text for the str (decoded with the detected encoding), or json() to load a JSON response as a Python data structure:

>>> response.body
b'{"foo": "bar"}'
>>> response.text
'{"foo": "bar"}'
>>> response.json()
{'foo': 'bar'}

There are also methods to select content from responses: jmespath() for JSON and css() and xpath() for HTML and XML:

>>> response.jmespath("foo")
[<Selector query='foo' data='bar'>]
>>> response.css("h1::text")
[<Selector query='descendant-or-self::h1/text()' data='Title'>]
>>> response.xpath("//h1/text()")
[<Selector query='//h1/text()' data='Title'>]

Custom input classes

You may define your own input classes if you are using a framework that supports it.

However, note that custom input classes may make your page object classes less portable across frameworks.