.. _framework: ================== Built-in framework ================== :mod:`web_poet.framework` is a built-in :ref:`web-poet framework ` for simple use cases. It is designed to be easy to use for quick proof-of-concepts, simple scripts, and for generating test fixtures. It can also serve as a reference implementation for framework authors. Limitations =========== The main limitation of the built-in framework is that it is not a complete scraping framework like :doc:`Scrapy `, which can support web-poet thanks to :doc:`scrapy-poet `. As a web-poet framework, the built-in framework also lacks support for :ref:`custom input classes `, :exc:`~web_poet.exceptions.Retry` and :exc:`~web_poet.exceptions.UseFallback`. Also, :ref:`browser inputs ` only support plain GET requests. Requests with a non-GET method, headers or a body raise :exc:`~web_poet.exceptions.http.HttpRequestError`. Installation ============ To use :mod:`web_poet.framework`, install the ``framework`` extra: .. code-block:: bash pip install web-poet[framework] For :ref:`browser support `, you also need to `install at least 1 browser with Playwright `__. For example, to install the main browsers: .. code-block:: bash playwright install Basic use ========= .. code-block:: python from dataclasses import dataclass from web_poet import WebPage from web_poet.framework import Framework from web_poet.utils import ensure_awaitable @dataclass class Book: title: str class BookPage(WebPage[Book]): @field def title(self) -> str: return self.response.css("h1::text").get() framework = Framework() item = await framework.get_item("https://books.example.com/book/1", BookPage) # Or, if you prefer, get a page object instance first. page = await framework.get_page("https://books.example.com/book/1", BookPage) item = await ensure_awaitable(page.to_item()) Choosing a page object class automatically ========================================== If you decorate your page object classes with :func:`~web_poet.handle_urls` and make sure they are imported, e.g. with :func:`~web_poet.consume_modules`, you can pass :meth:`~web_poet.framework.Framework.get_item` an item class, and let it determine which page object class to use: .. code-block:: python from dataclasses import dataclass from web_poet import WebPage, handle_urls from web_poet.framework import Framework @dataclass class Book: title: str @handle_urls("books.example.com") class BookPage(WebPage[Book]): @field def title(self) -> str: return self.response.css("h1::text").get() framework = Framework() item = await framework.get_item("https://books.example.com/book/1", Book) .. _framework-browser: Browser ======= The built-in framework can use `Playwright `_ to resolve browser dependencies like :class:`~web_poet.page_inputs.browser.BrowserHtml` or :class:`~web_poet.page_inputs.browser.BrowserResponse`. Chromium is used by default. You can override that by passing ``default_playwright_engine`` to :class:`~web_poet.framework.Framework`. Page objects can also annotate their Playwright engine dependencies with :func:`~web_poet.framework.playwright_engine` to specify which engine they require. For example: .. code-block:: python from typing import Annotated from web_poet import WebPage, Item from web_poet.page_inputs.browser import BrowserResponse from web_poet.framework import playwright_engine class MyPageObject(WebPage[Item]): response = Annotated[BrowserResponse, playwright_engine("firefox")] Stats ===== The built-in framework supports :class:`~web_poet.page_inputs.stats.Stats`. By default, :class:`~web_poet.framework.Framework` creates a :class:`~web_poet.page_inputs.stats.DictStatCollector` object, exposes it to any page object that requests :class:`~web_poet.page_inputs.stats.Stats`, and exposes that object as the :data:`stats ` attribute of the framework: .. code-block:: python from web_poet.framework import Framework framework = Framework() item1 = await framework.get_item("http://example.com/book/1", BookPage) item2 = await framework.get_item("http://example.com/book/2", BookPage) all_stats = framework.stats :class:`~web_poet.framework.Framework` also supports passing a custom stats collector: .. code-block:: python from web_poet.page_inputs.stats import StatCollector class MyStatCollector(StatCollector): ... framework = Framework(stats=MyStatCollector())