Skip to content

Commit

Permalink
Add session management (#193)
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio authored Jun 19, 2024
1 parent 919bf42 commit f582076
Show file tree
Hide file tree
Showing 21 changed files with 3,724 additions and 52 deletions.
12 changes: 12 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[run]
branch = true
include = scrapy_zyte_api/*
omit =
tests/*
disable_warnings = include-ignored

[report]
# https://github.com/nedbat/coveragepy/issues/831#issuecomment-517778185
exclude_lines =
pragma: no cover
if TYPE_CHECKING:
5 changes: 3 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,9 @@ jobs:
run: |
tox -e ${{ matrix.toxenv || 'py' }}
- name: coverage
if: ${{ success() }}
run: bash <(curl -s https://codecov.io/bash)
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}

check:
runs-on: ubuntu-latest
Expand Down
11 changes: 10 additions & 1 deletion CHANGES.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
Changes
=======

N.N.N (YYYY-MM-DD)
------------------

* The recommended position for ``ScrapyZyteAPIDownloaderMiddleware`` changed
from 1000 to 633, to accommodate for the new
``ScrapyZyteAPISessionDownloaderMiddleware``, which needs to be after
``ScrapyZyteAPIDownloaderMiddleware`` and before the Scrapy cookie downloader
middleware (700).

0.18.4 (2024-06-10)
-------------------

Expand Down Expand Up @@ -396,7 +405,7 @@ When upgrading, you should set the following in your Scrapy settings:
.. code-block:: python
DOWNLOADER_MIDDLEWARES = {
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000,
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 633,
}
# only applicable for Scrapy 2.7+
REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
Expand Down
4 changes: 4 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,8 @@
"https://zyte-common-items.readthedocs.io/en/latest",
None,
),
"zyte-spider-templates": (
"https://zyte-spider-templates.readthedocs.io/en/latest",
None,
),
}
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ either :ref:`globally <transparent>` or :ref:`per request <automap>`, or
usage/default
usage/retry
usage/scrapy-poet
usage/session
usage/stats
usage/fingerprint
usage/proxy
Expand Down
43 changes: 43 additions & 0 deletions docs/reference/meta.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,46 @@ string.
<https://github.com/jd/tenacity/issues/147>`_.

See :ref:`retry`.


.. reqmeta:: zyte_api_session_enabled

zyte_api_session_enabled
=========================

Default: :setting:`ZYTE_API_SESSION_ENABLED`

Whether to use :ref:`scrapy-zyte-api session management <session>` for the
request (``True``) or not (``False``).


.. reqmeta:: zyte_api_session_location

zyte_api_session_location
=========================

Default: ``{}``

Address for ``setLocation``-based session initialization. See
:setting:`ZYTE_API_SESSION_LOCATION` for details.

This request metadata key, if not empty, takes precedence over the
:setting:`ZYTE_API_SESSION_LOCATION` setting, the
:setting:`ZYTE_API_SESSION_PARAMS` setting, and the
:reqmeta:`zyte_api_session_location` request metadata key.


.. reqmeta:: zyte_api_session_params

zyte_api_session_params
=======================

Default: ``{}``

Parameters to use for session initialization. See
:setting:`ZYTE_API_SESSION_PARAMS` for details.

This request metadata key, if not empty, takes precedence over the
:setting:`ZYTE_API_SESSION_PARAMS` setting, but it can be overridden
by the :setting:`ZYTE_API_SESSION_LOCATION` setting or the
:reqmeta:`zyte_api_session_location` request metadata key.
2 changes: 1 addition & 1 deletion docs/reference/request.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ They will be mapped even if defined with their default value.
Headers will also be mapped if set to a non-default value elsewhere, e.g. in a
custom downloader middleware, as long as it is done before the scrapy-zyte-api
downloader middleware, which is responsible for the mapping, processes the
request. Here “before” means a lower value than ``1000`` in the
request. Here “before” means a lower value than ``633`` in the
:setting:`DOWNLOADER_MIDDLEWARES <scrapy:DOWNLOADER_MIDDLEWARES>` setting.

Similarly, you can add any of those headers to the
Expand Down
258 changes: 257 additions & 1 deletion docs/reference/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,8 @@ ZYTE_API_MAX_REQUESTS
Default: ``None``

When set to an integer value > 0, the spider will close when the number of Zyte
API requests reaches it.
API requests reaches it, with ``closespider_max_zapi_requests`` as the close
reason.

Note that requests with error responses that cannot be retried or exceed their
retry limit also count here.
Expand Down Expand Up @@ -246,6 +247,261 @@ subclass.
See :ref:`retry`.


.. setting:: ZYTE_API_SESSION_CHECKER

ZYTE_API_SESSION_CHECKER
========================

Default: ``None``

A :ref:`Scrapy component <topics-components>` (or its import path as a string)
that defines a ``check`` method.

If ``check`` returns ``True``, the response session is considered valid; if
``check`` returns ``False``, the response session is considered invalid, and
will be discarded. ``check`` can also raise a
:exc:`~scrapy.exceptions.CloseSpider` exception to close the spider.

If defined, the ``check`` method is called on every response that is using a
:ref:`session managed by scrapy-zyte-api <session>`. If not defined, the
default implementation checks the outcome of the ``setLocation`` action if
session initialization was location-based, as described in
:ref:`session-check`.

Example:

.. code-block:: python
:caption: settings.py
from scrapy import Request
from scrapy.http.response import Response
class MySessionChecker:
def check(self, request: Request, response: Response) -> bool:
return bool(response.css(".is_valid"))
ZYTE_API_SESSION_CHECKER = MySessionChecker
Because the session checker is a Scrapy component, you can access the crawler
object, for example to read settings:

.. code-block:: python
:caption: settings.py
from scrapy import Request
from scrapy.http.response import Response
class MySessionChecker:
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def __init__(self, crawler):
location = crawler.settings["ZYTE_API_SESSION_LOCATION"]
self.postal_code = location["postalCode"]
def check(self, request: Request, response: Response) -> bool:
return response.css(".postal_code::text").get() == self.postal_code
ZYTE_API_SESSION_CHECKER = MySessionChecker
.. setting:: ZYTE_API_SESSION_ENABLED

ZYTE_API_SESSION_ENABLED
========================

Default: ``False``

Enables :ref:`scrapy-zyte-api session management <session>`.


.. setting:: ZYTE_API_SESSION_LOCATION

ZYTE_API_SESSION_LOCATION
=========================

Default: ``{}``

If defined, sessions are initialized using the ``setLocation``
:http:`action <request:actions>`, and the value of this setting must be the
target address :class:`dict`. For example:

.. code-block:: python
:caption: settings.py
ZYTE_API_SESSION_LOCATION = {"postalCode": "10001"}
If the :setting:`ZYTE_API_SESSION_PARAMS` setting or the
:reqmeta:`zyte_api_session_params` request metadata key set a ``"url"``, it
will be used for session initialization as well. Otherwise, the URL of the
request for which the session is being initialized will be used instead.

This setting, if not empty, takes precedence over the
:setting:`ZYTE_API_SESSION_PARAMS` setting and the
:reqmeta:`zyte_api_session_params` request metadata key, but it can be
overridden by the :reqmeta:`zyte_api_session_location` request metadata key.

To disable the :setting:`ZYTE_API_SESSION_LOCATION` setting on a specific
request, e.g. to use the :setting:`ZYTE_API_SESSION_PARAMS` setting or the
:reqmeta:`zyte_api_session_params` request metadata key instead, set
the :reqmeta:`zyte_api_session_location` request metadata key to ``{}``.


.. setting:: ZYTE_API_SESSION_MAX_BAD_INITS

ZYTE_API_SESSION_MAX_BAD_INITS
==============================

Default: ``8``

The maximum number of :ref:`scrapy-zyte-api sessions <session>` per pool that
are allowed to fail their session check right after creation in a row. If the
maximum is reached, the spider closes with ``bad_session_inits`` as the close
reason.

To override this value for specific pools, use
:setting:`ZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL`.


.. setting:: ZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL

ZYTE_API_SESSION_MAX_BAD_INITS_PER_POOL
=======================================

Default: ``{}``

:class:`dict` where keys are :ref:`pool <session-pools>` IDs and values are
overrides of :setting:`ZYTE_API_SESSION_POOL_SIZE` for those pools.


.. setting:: ZYTE_API_SESSION_MAX_ERRORS

ZYTE_API_SESSION_MAX_ERRORS
===========================

Default: ``1``

Maximum number of :ref:`unsuccessful responses
<zyte-api-unsuccessful-responses>` allowed for any given session before
discarding the session.

You might want to increase this number if you find that a session may continue
to work even after an unsuccessful response. See :ref:`optimize-sessions`.

.. note:: This setting does not affect session checks
(:setting:`ZYTE_API_SESSION_CHECKER`). A session is always discarded the
first time it fails its session check.


.. setting:: ZYTE_API_SESSION_PARAMS

ZYTE_API_SESSION_PARAMS
=======================

Default: ``{"browserHtml": True}``

Parameters to use for session initialization.

It works similarly to :http:`request:sessionContextParams` from
:ref:`server-managed sessions <zyte-api-session-contexts>`, but it supports
arbitrary Zyte API parameters instead of a specific subset.

If it does not define a ``"url"``, the URL of the request for which the session
is being initialized will be used.

This setting can be overridden by the :setting:`ZYTE_API_SESSION_LOCATION`
setting, the :reqmeta:`zyte_api_session_location` request metadata key, or the
:reqmeta:`zyte_api_session_params` request metadata key.

Example:

.. code-block:: python
:caption: settings.py
ZYTE_API_SESSION_PARAMS = {
"browserHtml": True,
"actions": [
{
"action": "setLocation",
"address": {"postalCode": "10001"},
}
],
}
.. tip:: The example above is equivalent to setting
:setting:`ZYTE_API_SESSION_LOCATION` to ``{"postalCode": "10001"}``.


.. setting:: ZYTE_API_SESSION_POOL_SIZE

ZYTE_API_SESSION_POOL_SIZE
==========================

Default: ``8``

The maximum number of active :ref:`scrapy-zyte-api sessions <session>` to keep
per :ref:`pool <session-pools>`.

To override this value for specific pools, use
:setting:`ZYTE_API_SESSION_POOL_SIZES`.

Increase this number to lower the frequency with which requests are sent
through each session, which on some websites may increase the lifetime of each
session. See :ref:`optimize-sessions`.


.. setting:: ZYTE_API_SESSION_POOL_SIZES

ZYTE_API_SESSION_POOL_SIZES
===========================

Default: ``{}``

:class:`dict` where keys are :ref:`pool <session-pools>` IDs and values are
overrides of :setting:`ZYTE_API_SESSION_POOL_SIZE` for those pools.


.. setting:: ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS

ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS
===================================

Default: ``60``

scrapy-zyte-api maintains a rotation queue of ready-to-use sessions per
:ref:`pool <session-pools>`. At some points, the queue might be empty for a
given pool because all its sessions are in the process of being initialized or
refreshed.

If the queue is empty when trying to assign a session to a request,
scrapy-zyte-api will wait some time
(:setting:`ZYTE_API_SESSION_QUEUE_WAIT_TIME`), and then try to get a session
from the queue again.

Use this setting to configure the maximum number of attempts before giving up
and raising a :exc:`RuntimeError` exception.


.. setting:: ZYTE_API_SESSION_QUEUE_WAIT_TIME

ZYTE_API_SESSION_QUEUE_WAIT_TIME
===================================

Default: ``1.0``

Number of seconds to wait between attempts to get a session from a rotation
queue.

See :setting:`ZYTE_API_SESSION_QUEUE_MAX_ATTEMPTS` for details.


.. setting:: ZYTE_API_SKIP_HEADERS

ZYTE_API_SKIP_HEADERS
Expand Down
Loading

0 comments on commit f582076

Please sign in to comment.