Additional requests¶
Websites nowadays needs a lot of page interactions to display or load some key information. In most cases, these are done via AJAX requests. Some examples of these are:
Clicking a button on a page to reveal other similar products.
Clicking the “Load More” button to retrieve more images of a given item.
Scrolling to the bottom of the page to load more items (i.e. infinite scrolling).
Hovering on a certain webpage element that reveals a tool-tip containing additional page info.
As such, performing additional requests inside Page Objects are inevitable to properly extract data for some websites.
Warning
Additional requests made inside a Page Object aren’t meant to represent the Crawling Logic at all. They are simply a low-level way to interact with today’s websites which relies on a lot of page interactions to display its contents.
HttpRequest¶
Additional requests are defined using a simple data container that represents
a generic HTTP Request: HttpRequest
. Here’s an example:
import json
import web_poet
request = web_poet.HttpRequest(
url="https://www.api.example.com/product-pagination/",
method="POST",
headers={
"Content-Type": "application/json;charset=UTF-8"
},
body=json.dumps(
{
"Page": page_num,
"ProductID": product_id,
}
).encode("utf-8"),
)
print(request.url) # https://www.api.example.com/product-pagination/
print(type(request.url)) # <class 'web_poet.page_inputs.http.RequestUrl'>
print(request.method) # POST
print(type(request.headers) # <class 'web_poet.page_inputs.HttpRequestHeaders'>
print(request.headers) # <HttpRequestHeaders('Content-Type': 'application/json;charset=UTF-8')>
print(request.headers.get("content-type")) # application/json;charset=UTF-8
print(request.headers.get("does-not-exist")) # None
print(type(request.body)) # <class 'web_poet.page_inputs.HttpRequestBody'>
print(request.body) # b'{"Page": 1, "ProductID": 123}'
There are a few things to take note here:
method
is simply a string.
url
is represented by theRequestUrl
class.
headers
is represented by theHttpRequestHeaders
class which resembles adict
-like interface. It supports case-insensitive header-key lookups as well as multi-key storage.
See
multidict.CIMultiDict
for the set of features sinceHttpRequestHeaders
simply inherits from it.
body
is represented by theHttpRequestBody
class which is simply a subclass of thebytes
class. Using thebody
param ofHttpRequest
needs to have an input argument inbytes
. In our code example, we’ve converted it fromstr
tobytes
using theencode()
string method.
Most of the time though, what you’ll be defining would be GET
requests. Thus,
it’s perfectly fine to define them as:
import web_poet
request = web_poet.HttpRequest("https://api.example.com/product-info?id=123")
print(request.url) # https://api.example.com/product-info?id=123
print(type(request.url)) # <class 'web_poet.page_inputs.http.RequestUrl'>
print(request.method) # GET
print(type(request.headers) # <class 'web_poet.page_inputs.HttpRequestHeaders'>
print(request.headers) # <HttpRequestHeaders()>
print(request.headers.get("content-type")) # None
print(request.headers.get("does-not-exist")) # None
print(type(request.body)) # <class 'web_poet.page_inputs.HttpRequestBody'>
print(request.body) # b''
The key take aways are:
The default value of
method
isGET
.
headers
still holdsHttpRequestHeaders
which doesn’t contain anything.The same is true for
body
holding an emptyHttpRequestBody
.
Now that we know how HttpRequest
are structured, defining them doesn’t
execute the actual requests at all. In order to do so, we’ll need to feed it into
the HttpClient
which is defined in the next section (see
HttpClient tutorial section).
HttpResponse¶
HttpResponse
is what comes after a HttpRequest
has been
executed. It’s typically returned by the methods from HttpClient
(see
HttpClient tutorial section) which holds the information regarding the response.
HttpResponse
can also be used as a Page Object dependency,
e.g. WebPage
uses it.
Note
The additional requests are expected to perform redirections except when the
method is HEAD
. This means that the HttpResponse
that you’ll
be receiving is already the end of the redirection trail.
Let’s check out an example to see its internals:
import web_poet
response = web_poet.HttpResponse(
url="https://www.api.example.com/product-pagination/",
body='{"data": "value 👍"}'.encode("utf-8"),
status=200,
headers={"Content-Type": "application/json;charset=UTF-8"}
)
print(response.url) # https://www.api.example.com/product-pagination/
print(type(response.url)) # <class 'web_poet.page_inputs.http.ResponseUrl'>
print(response.body) # b'{"data": "value \xf0\x9f\x91\x8d"}'
print(type(response.body)) # <class 'web_poet.page_inputs.HttpResponseBody'>
print(response.status) # 200
print(type(response.status)) # <class 'int'>
print(response.headers) # <HttpResponseHeaders('Content-Type': 'application/json;charset=UTF-8')>
print(type(response.headers)) # <class 'web_poet.page_inputs.HttpResponseHeaders'>
print(response.headers.get("content-type")) # application/json;charset=UTF-8
print(response.headers.get("does-not-exist")) # None
# These methods are also available:
print(response.body.declared_encoding()) # None
print(response.body.json()) # {'data': 'value 👍'}
print(response.headers.declared_encoding()) # utf-8
print(response.encoding) # utf-8
print(response.text) # {"data": "value 👍"}
print(response.json()) # {'data': 'value 👍'}
Despite what the example above showcases, you won’t be typically defining
HttpResponse
yourself as it’s the implementing framework (see
Supporting additional requests) that’s responsible for it. Nonetheless,
it’s important to understand its underlying structure in order to better access
its methods.
Here are the key take aways from the example above:
status
is simply an int.
url
is represented by theResponseUrl
class.
headers
is represented by theHttpResponseHeaders
class. It’s similar toHttpRequestHeaders
where it inherits frommultidict.CIMultiDict
, granting it case-insensitive header-key lookups as well as multi-key storage.
The encoding can be derived using the
declared_encoding()
method. In this example, it was retrieved from theContent-Type
header.
body
is represented by theHttpResponseBody
class which is simply a subclass of thebytes
class. Using thebody
param ofHttpResponse
needs to have an input argument inbytes
. In our code example, we’ve converted it fromstr
tobytes
using theencode()
string method.
Similar to the headers, the encoding can be derived using the
declared_encoding()
. In this case, it returnedNone
since no encoding can be derived from the response body.A
json()
method is also available to conveniently access decoded contents from JSON responses. It uses the derived encoding to properly decode the contents like the 👍 emoji.The
HttpResponse
class itself also have these convenient methods:
The
encoding()
property method returns the proper encoding of the response based on this hierarchy:
user-specified encoding (using the
_encoding
attribute)BOM from the body
header encodings
body encodings
Instead of accessing the raw bytes values (which doesn’t represent the underlying content properly like the 👍 emoji), the
text()
property method can be used which takes into account the derived encoding when decoding the bytes value.The
json()
method is available as a shortcut toHttpResponseBody
’sjson()
method.
We’ve only explored a JSON response as a result from an additional request. Let’s take a look at another example having an HTML response:
import web_poet
response = web_poet.HttpResponse(
url="https://www.api.example.com/product-pagination/",
body=(
'<html>'
' <head>'
' <title>Some page</title>'
' <meta http-equiv="Content-Type" content="text/html; charset=utf-8">'
' </head>'
' <body>Sample content 💯</body>'
'</html>'
).encode("utf-8"),
status=200,
headers={}
)
print(response.headers.declared_encoding()) # None
print(response.body.declared_encoding()) # utf-8
print(response.encoding) # utf-8
print(response.body.json()) # JSONDecodeError
print(response.json()) # JSONDecodeError
print(type(response.selector)) # <class 'parsel.selector.Selector'>
print(response.selector.css("body ::text").get()) # Sample content 💯
print(response.css("body ::text").get()) # Sample content 💯
print(response.selector.xpath("//body/text()").get()) # Sample content 💯
print(response.xpath("//body/text()").get()) # Sample content 💯
The key take aways for this example are:
The encoding is derived from the body inside the
meta
tags since theheaders
is empty for this example.Since we now have an HTML response, using
json()
method would raise aJSONDecodeError
as a JSON document cannot be parsed from it.The
selector()
property is an instance ofparsel.selector.Selector
; there are alsocss()
andxpath()
methods.
Usually there’s no need to use
selector()
, ascss()
andxpath()
are available.
HttpClient¶
The main interface for executing additional requests would be HttpClient
.
It also has full support for asyncio
enabling developers to perform
additional requests asynchronously using asyncio.gather()
,
asyncio.wait()
, etc. This means that asyncio
could be used anywhere
inside the Page Object, including the to_item()
method.
In the previous section, we’ve explored how HttpRequest
is defined.
Let’s see a few quick examples to see how to execute additional requests using
the HttpClient
.
Executing a HttpRequest instance¶
import attrs
import web_poet
from web_poet import validates_input
@attrs.define
class ProductPage(web_poet.WebPage):
http: web_poet.HttpClient
@validates_input
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}
# Simulate clicking on a button that says "View All Images"
request = web_poet.HttpRequest(f"https://api.example.com/v2/images?id={item['product_id']}")
response: web_poet.HttpResponse = await self.http.execute(request)
item["images"] = response.css(".product-images img::attr(src)").getall()
return item
As the example suggests, we’re performing an additional request that allows us to extract more images in a product page that might not be otherwise be possible. This is because in order to do so, an additional button needs to be clicked which fetches the complete set of product images via AJAX.
There are a few things to take note of this example:
Recall from the HttpRequest tutorial section that the default method is
GET
. Thus, themethod
parameter can be omitted for simpleGET
requests.We’re now using the
async/await
syntax inside theto_item()
method.The response from the additional request is of type
HttpResponse
.
Tip
Check out the Batch requests tutorial section to see how
to execute a group of HttpRequest
in batch.
Fortunately, there are already some quick shortcuts on how to perform single
additional requests using the request()
, get()
,
and post()
methods of HttpClient
. These already
define the HttpRequest
and executes it as well.
A simple GET
request¶
Let’s use the example from the previous section and use the get()
method on it.
import attrs
import web_poet
from web_poet import validates_input
@attrs.define
class ProductPage(web_poet.WebPage):
http: web_poet.HttpClient
@validates_input
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}
# Simulates clicking on a button that says "View All Images"
response: web_poet.HttpResponse = await self.http.get(
f"https://api.example.com/v2/images?id={item['product_id']}"
)
item["images"] = response.css(".product-images img::attr(src)").getall()
return item
There are a few things to take note in this example:
A
GET
request can be done viaHttpClient
’sget()
method.There is no need create an instance of
HttpRequest
whenget()
is used.
A POST
request with header and body¶
Let’s see another example which needs headers
and body
data to process
additional requests.
In this example, we’ll paginate related items in a carousel. These are usually lazily loaded by the website to reduce the amount of information rendered in the DOM that might not otherwise be viewed by all users anyway.
Thus, additional requests inside the Page Object are typically needed for it:
import attrs
import web_poet
from web_poet import validates_input
@attrs.define
class ProductPage(web_poet.WebPage):
http: web_poet.HttpClient
@validates_input
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
"related_product_ids": self.parse_related_product_ids(self),
}
# Simulates "scrolling" through a carousel that loads related product items
response: web_poet.HttpResponse = await self.http.post(
url="https://www.api.example.com/related-products/",
headers={
"Content-Type": "application/json;charset=UTF-8"
},
body=json.dumps(
{
"Page": 2,
"ProductID": item["product_id"],
}
).encode("utf-8"),
)
item["related_product_ids"].extend(self.parse_related_product_ids(response))
return item
@staticmethod
def parse_related_product_ids(response_page) -> List[str]:
return response_page.css("#main .related-products ::attr(product-id)").getall()
Here’s the key takeaway in this example:
Similar to
HttpClient
’sget()
method, apost()
method is also available. It is often used to submit forms.
Other Single Requests¶
The get()
and post()
methods are merely
quick shortcuts for request()
:
client = HttpClient()
url = "https://api.example.com/v1/data"
headers = {"Content-Type": "application/json;charset=UTF-8"}
body = b'{"data": "value"}'
# These are the same:
response = await client.get(url)
response = await client.request(url, method="GET")
# The same goes for these:
response = await client.post(url, headers=headers, body=body)
response = await client.request(url, method="POST", headers=headers, body=body)
Thus, apart from the common GET
and POST
HTTP methods, you can use
request()
for them (e.g. HEAD
, PUT
, DELETE
, etc).
Batch requests¶
We can also choose to process requests by batch instead of sequentially or
one by one (e.g. using execute()
). The batch_execute()
method can be used for this which accepts an arbitrary number of HttpRequest
instances.
Let’s modify the example in the previous section to see how it can be done.
The difference for this code example from the previous section is that we’re
increasing the pagination from only the 2nd page into the 10th page.
Instead of calling a single post()
method, we’re creating a
list of HttpRequest
to be executed in batch using the
batch_execute()
method.
from typing import List
import attrs
import web_poet
from web_poet import validates_input
@attrs.define
class ProductPage(web_poet.WebPage):
http: web_poet.HttpClient
default_pagination_limit = 10
@validates_input
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
"related_product_ids": self.parse_related_product_ids(self),
}
requests: List[web_poet.HttpRequest] = [
self.create_request(item["product_id"], page_num=page_num)
for page_num in range(2, self.default_pagination_limit)
]
responses: List[web_poet.HttpResponse] = await self.http.batch_execute(*requests)
related_product_ids = [
id_
for response in responses
for product_ids in self.parse_related_product_ids(response)
for id_ in product_ids
]
item["related_product_ids"].extend(related_product_ids)
return item
def create_request(self, product_id, page_num=2):
# Simulates "scrolling" through a carousel that loads related product items
return web_poet.HttpRequest(
url="https://www.api.example.com/product-pagination/",
method="POST",
headers={
"Content-Type": "application/json;charset=UTF-8"
},
body=json.dumps(
{
"Page": page_num,
"ProductID": product_id,
}
).encode("utf-8"),
)
@staticmethod
def parse_related_product_ids(response_page) -> List[str]:
return response_page.css("#main .related-products ::attr(product-id)").getall()
The key takeaways for this example are:
An
HttpRequest
can be instantiated to represent a Generic HTTP Request. It only contains the HTTP Request information for now and isn’t executed yet. This is useful for creating factory methods to help create requests without any download execution at all.
HttpClient
has abatch_execute()
method that can process a list ofHttpRequest
instances asynchronously together.
Tip
The batch_execute()
method can execute multiple
HttpRequest
instances. For example, it could be a mixture
of GET
and POST
requests or even
representing requests for various parts of the page altogether.
Processing the additional requests in batch is useful since it takes advantage of async execution which could be faster in certain cases (assuming you’re allowed to perform HTTP requests in parallel).
Nonetheless, you can still use the batch_execute()
method
to execute a single HttpRequest
instance.
Note
The batch_execute()
method is a simple wrapper over
asyncio.gather()
. Developers are free to use other functionalities
available inside asyncio
to handle multiple requests.
For example, asyncio.as_completed()
can be used to process the
first response from a group of requests as early as possible. However, the
order could be shuffled.
Handling Exceptions in Page Objects¶
Let’s have a look at how we could handle exceptions when performing additional requests inside Page Objects. For this example, let’s improve the code snippet from the previous subsection named: A simple GET request.
import logging
import attrs
import web_poet
from web_poet import validates_input
logger = logging.getLogger(__name__)
@attrs.define
class ProductPage(web_poet.WebPage):
http: web_poet.HttpClient
@validates_input
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}
try:
# Simulates clicking on a button that says "View All Images"
response: web_poet.HttpResponse = await self.http.get(
f"https://api.example.com/v2/images?id={item['product_id']}"
)
except web_poet.exceptions.HttpRequestError as err:
logger.warning(
f"Unable to request images for product ID '{item['product_id']}' "
f"using this request: {err.request}"
)
except web_poet.exceptions.HttpResponseError as err:
logger.warning(
f"Received a {err.response.status} response status for product ID "
f"'{item['product_id']}' from this URL: {err.request.url}"
)
else:
item["images"] = response.css(".product-images img::attr(src)").getall()
return item
In this code example, the code became more resilient on cases where it wasn’t possible to retrieve more images using the website’s public API. It could be due to anything like SSL errors, connection errors, page not found, etc.
Using HttpClient
to execute requests raises exceptions with the base
class of type web_poet.exceptions.http.HttpError
irregardless of how
the HTTP Downloader is implemented. From our example above, we could’ve simply
used the web_poet.exceptions.http.HttpError
base error. However, it’s
ambiguous in the sense that the error could originate during the HTTP Request
execution or when receiving the HTTP Response.
A more specific web_poet.exceptions.http.HttpRequestError
exception is
raised when the HttpRequest
was being handled while the
web_poet.exceptions.http.HttpResponseError
is raised when receiving
a response with an HTTP error. Notice from the example that the exceptions have
the attributes like request
and response
which are respective instance of
HttpRequest
and HttpResponse
. Accessing them would be useful
to debug and log the problems.
Note that web_poet.exceptions.http.HttpResponseError
only occurs when
receiving responses with status codes in the 400-5xx
range. However, this
behavior could be altered by using the allow_status
param in the methods of
HttpClient
.
Note
In the future, more specific exceptions which inherits from the base
web_poet.exceptions.http.HttpError
exception would be available.
This should allow developers writing Page Objects to properly identify what
went wrong and act specifically based on the problem.
Let’s take another example when executing requests in batch as opposed to using
single requests via these methods of the HttpClient
:
request()
, get()
, and post()
.
For this example, let’s improve the code snippet from the previous subsection named: Batch requests.
import logging
from typing import List, Union
import attrs
import web_poet
from web_poet import validates_input
@attrs.define
class ProductPage(web_poet.WebPage):
http: web_poet.HttpClient
default_pagination_limit = 10
@validates_input
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
"related_product_ids": self.parse_related_product_ids(self),
}
requests: List[web_poet.HttpRequest] = [
self.create_request(item["product_id"], page_num=page_num)
for page_num in range(2, self.default_pagination_limit)
]
try:
responses: List[web_poet.HttpResponse] = await self.http.batch_execute(*requests)
except web_poet.exceptions.HttpError:
logger.warning(
f"Unable to request for more related products for product ID: {item['product_id']}"
)
else:
related_product_ids = []
for response in responses:
related_product_ids.extend(
[
id_
for product_ids in self.parse_related_product_ids(response)
for id_ in product_ids
]
)
item["related_product_ids"].extend(related_product_ids)
return item
def create_request(self, product_id, page_num=2):
# Simulates "scrolling" through a carousel that loads related product items
return web_poet.HttpRequest(
url="https://www.api.example.com/product-pagination/",
method="POST",
headers={
"Content-Type": "application/json;charset=UTF-8"
},
body=json.dumps(
{
"Page": page_num,
"ProductID": product_id,
}
).encode("utf-8"),
)
@staticmethod
def parse_related_product_ids(response_page) -> List[str]:
return response_page.css("#main .related-products ::attr(product-id)").getall()
Handling exceptions using batch_execute()
remains largely the same.
However, the main difference is that you may be wasting perfectly good responses just
because a single request from the batch ruined it. Notice that we’re using the base
exception class of web_poet.exceptions.http.HttpError
to account for any
type of errors, both during the HTTP Request execution and when receiving the
response.
An alternative approach would be salvaging good responses altogether. For example, you’ve
sent out 10 HttpRequest
and only 1 of them had an exception during processing.
You can still get the data from 9 of the HttpResponse
by passing the parameter
return_exceptions=True
to batch_execute()
.
This means that any exceptions raised during the HTTP execution are returned alongside any
of the successful responses. The return type of batch_execute()
could
be a mixture of HttpResponse
and web_poet.exceptions.http.HttpError
(and its exception subclasses).
Here’s an example:
# Revised code snippet from the to_item() method
requests: List[web_poet.HttpRequest] = [
self.create_request(item["product_id"], page_num=page_num)
for page_num in range(2, self.default_pagination_limit)
]
responses: List[Union[web_poet.HttpResponse, web_poet.exceptions.HttpError]] = (
await self.http.batch_execute(*requests, return_exceptions=True)
)
related_product_ids = []
for i, response in enumerate(responses):
if isinstance(response, web_poet.exceptions.HttpError):
logger.warning(
f"Unable to request related products for product ID '{item['product_id']}' "
f"using this request: {requests[i]}. Reason: {response}."
)
continue
related_product_ids.extend(
[
id_
for product_ids in self.parse_related_product_ids(response)
for id_ in product_ids
]
)
item["related_product_ids"].extend(related_product_ids)
return item
From the example above, we’re now checking the list of responses to see if any exceptions are included in it. If so, we’re simply logging it down and ignoring it. In this way, perfectly good responses can still be processed through.
Retrying Additional Requests¶
When the bad response data comes from additional requests, you must handle retries on your own.
The page object code is responsible for retrying additional requests until good response data is received, or until some maximum number of retries is exceeded.
It is up to you to decide what the maximum number of retries should be for a given additional request, based on your experience with the target website.
It is also up to you to decide how to implement retries of additional requests.
One option would be tenacity. For example, to try an additional request 3 times before giving up:
import attrs
from tenacity import retry, stop_after_attempt
from web_poet import HttpClient, WebPage, validates_input
@attrs.define
class MyPage(WebPage):
http: HttpClient
@retry(stop=stop_after_attempt(3))
async def get_data(self):
response = await self.http.get("https://toscrape.com/")
if not response.css(".expected"):
raise ValueError
return response.css(".data").get()
@validates_input
async def to_item(self) -> dict:
try:
data = await self.get_data()
except ValueError:
return {}
return {"data": data}
If the reason your additional request fails is outdated or missing data from page object input, do not try to reproduce the request for that input as an additional request. Request fresh input instead.