Identify scrapy-zyte-api usage via custom user-agent #130

PyExplorer · 2023-09-21T08:57:08Z

Goals:

It should also allow to change the user agent, e.g. for zyte-crawlers use case (not implemented yet).
In the user agent we should keep the information about python-zyte-api version used (possible with the fix).

This is one of the options to set custom user-agent for scrapy-zyte-api and keep for python-zyte-api. But it requires changes in client.py python-zyte-api https://github.com/zytedata/python-zyte-api/blob/main/zyte_api/aio/client.py#L60 like:

    async def request_raw(self, query: dict, *,
                          endpoint: str = 'extract',
                          session=None,
                          handle_retries=True,
                          retrying: Optional[AsyncRetrying] = None,
                          ):
        retrying = retrying or self.retrying
        post = _post_func(session)
        auth = aiohttp.BasicAuth(self.api_key)
        headers = {
            'User-Agent': user_agent(aiohttp, extra_user_agent=query.pop("user-agent", None)),
            'Accept-Encoding': 'br'
        }
        response_stats = []
        start_global = time.perf_counter()

Some other options to provide the way to send user-agent from zyte-crawlers:
1.Use settings with name of package and set them during creating client (requires changes in AsyncClient)
2. Use cb_kwargs or meta - but looks like they are cleaned during the request process

It's good to discuss all above.

Gallaecio

It is probably best to pass the user agent to AsyncClient as a parameter instead, once python-zyte-api supports that.

PyExplorer · 2023-09-21T09:17:30Z

It is probably best to pass the user agent to AsyncClient as a parameter instead, once python-zyte-api supports that.

@Gallaecio is it ok to change AsyncClient this way?

    def __init__(self, *,
                 api_key=None,
                 api_url=API_URL,
                 n_conn=15,
                 retrying: Optional[AsyncRetrying] = None,
                 custom_user_agent=None
                 ):

Gallaecio · 2023-09-21T09:24:45Z

Sorry, accidentally edited your comment instead of answering 🤦

Answer: I think so, yes. Maybe even call it just user_agent, if the value is not None then it is custom 🙂 .

PyExplorer · 2023-09-21T09:30:28Z

It is probably best to pass the user agent to AsyncClient as a parameter instead, once python-zyte-api supports that.

aha, ok, with this. @Gallaecio How do you think we can send user-agent from zyte-crawlers? I was thinking about adding user-agent for python-zyte-api here https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/providers.py#L81 with meta (as only spider with zyte-crawlers go here), but probably it's better to send this right from zyte-crawlers repo with settings or else - any ideas?

PyExplorer · 2023-09-22T16:40:31Z

@Gallaecio could you take a look?
Related PR in python-zyte-api
zytedata/python-zyte-api#50

PyExplorer · 2023-09-25T20:12:51Z

@Gallaecio @kmike could you take a look?

scrapy_zyte_api/handler.py

kmike · 2023-09-26T07:54:42Z

scrapy_zyte_api/utils.py

@@ -1,6 +1,10 @@
+from importlib.metadata import version


I'm not sure we can use it though, https://docs.python.org/3/library/importlib.metadata.html says it's 3.8+, while scrapy-zyte-api declares Python 3.7 support.

scrapy-zyte-api declares Python 3.7 support

That has an easy fix :)

PyExplorer · 2023-09-27T06:34:38Z

@kmike @Gallaecio @wRAR @BurnzZ could you take a look at the PR with last changes from the discussion about version?

codecov · 2023-09-27T06:59:11Z

Codecov Report

Merging #130 (502a8f8) into main (207fed4) will increase coverage by 0.00%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head 502a8f8 differs from pull request most recent head 7b72bfe. Consider uploading reports for the commit 7b72bfe to get more accurate results

@@           Coverage Diff           @@
##             main     #130   +/-   ##
=======================================
  Coverage   98.81%   98.82%           
=======================================
  Files           9       10    +1     
  Lines         673      678    +5     
=======================================
+ Hits          665      670    +5     
  Misses          8        8

Files	Coverage Δ
scrapy_zyte_api/__version__.py	`100.00% <100.00%> (ø)`
scrapy_zyte_api/handler.py	`97.95% <100.00%> (+0.01%)`	⬆️
scrapy_zyte_api/utils.py	`100.00% <100.00%> (ø)`

scrapy_zyte_api/__version__.py

Gallaecio · 2023-09-27T07:03:55Z

I think it was OK to leave setup.py the way it was and have bump2version configuration edit an extra file instead of a different file, but I am OK with the new approach as well.

The only thing left is Python 3.7 support, I think.

wRAR · 2023-09-27T07:34:43Z

pinned-provider also failed, with "cannot import name 'USER_AGENT' from 'zyte_api.utils'"

Co-authored-by: Adrián Chaves <[email protected]>

PyExplorer · 2023-09-27T09:22:53Z

I think it was OK to leave setup.py the way it was and have bump2version configuration edit an extra file instead of a different file, but I am OK with the new approach as well.

The only thing left is Python 3.7 support, I think.

Oh, actually, I've made it as in python-zyte-api just for consistency

PyExplorer · 2023-09-27T10:04:06Z

pinned-provider also failed, with "cannot import name 'USER_AGENT' from 'zyte_api.utils'"

@wRAR what do you think will be the best option to fix it:

set "zyte-api>=0.4.7" in setup.py
fix test by importing and using only version from python-zyte-api like

from zyte_api.utils import __version__ as zyte_api_version
...
   "user_agent,expected",
    (
        (
            None,
            f"{USER_AGENT} python-zyte-api/{zyte_api_version}",
        ),
        (
            "zyte-crawlers/0.0.1",
            "zyte-crawlers/0.0.1",
        ),
    ),

scrapy_zyte_api/utils.py

scrapy_zyte_api/handler.py

.bumpversion.cfg

send new user_agent with api_params

5cebc2c

Gallaecio reviewed Sep 21, 2023

View reviewed changes

PyExplorer added 7 commits September 22, 2023 15:48

add _user_agent() to utils

75e1114

remove _user_agent() from handler

211a1b5

add user_agent to AsyncClient

31267dc

extract custom ua from settings and send to client

610477a

move package name to function

1b5b14f

add test for _user_agent()

7340a14

formatting

273cc79

PyExplorer requested a review from Gallaecio September 22, 2023 16:40

PyExplorer added 6 commits September 25, 2023 09:30

remove old test for user_agent

f61a714

add test for user_agent (for _build_client)

18f4269

se USER_AGENT as constant

d21bea8

send user_agent to client

b712e6a

change order for user agent in test/formatting

a9a7925

change order and delimeter for user agent in client

e4a6362

kmike reviewed Sep 26, 2023

View reviewed changes

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

kmike reviewed Sep 26, 2023

View reviewed changes

PyExplorer added 7 commits September 27, 2023 08:55

rename _USER_AGENT

1a945cb

_ZYTE_API_USER_AGENT rewrites any other user-agent

2ba5241

add __version__.py

b941f11

fetching version from __version__.py

dd9bddc

formatting

a1c51aa

formatting

d13670c

fix changing version in bump

327e89f

PyExplorer added 3 commits September 27, 2023 09:22

formatting

a928a26

set version to USER_AGENT from __version__.py

43f50ea

Merge branch 'main' into user-agent-for-scrapy-zyte-api

2f03c47

Gallaecio reviewed Sep 27, 2023

View reviewed changes

scrapy_zyte_api/__version__.py Outdated Show resolved Hide resolved

Set new version

32c4b97

Co-authored-by: Adrián Chaves <[email protected]>

kmike reviewed Sep 27, 2023

View reviewed changes

scrapy_zyte_api/utils.py Outdated Show resolved Hide resolved

kmike reviewed Sep 27, 2023

View reviewed changes

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

PyExplorer added 4 commits September 27, 2023 13:48

fix test to use full user_agent

5e997dd

construct user-agent at once

d3fad24

using USER_AGENT as default user agent

c779662

bump zyte-api version to the latest with user-agent

7b72bfe

Gallaecio approved these changes Sep 27, 2023

View reviewed changes

wRAR reviewed Sep 27, 2023

View reviewed changes

.bumpversion.cfg Show resolved Hide resolved

wRAR approved these changes Sep 27, 2023

View reviewed changes

PyExplorer requested a review from kmike September 27, 2023 14:17

BurnzZ approved these changes Sep 28, 2023

View reviewed changes

kmike approved these changes Sep 29, 2023

View reviewed changes

kmike merged commit d50b14d into scrapy-plugins:main Sep 29, 2023
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify scrapy-zyte-api usage via custom user-agent #130

Identify scrapy-zyte-api usage via custom user-agent #130

PyExplorer commented Sep 21, 2023

Gallaecio left a comment •

edited

Loading

PyExplorer commented Sep 21, 2023 •

edited by Gallaecio

Loading

Gallaecio commented Sep 21, 2023 •

edited

Loading

PyExplorer commented Sep 21, 2023

PyExplorer commented Sep 22, 2023

PyExplorer commented Sep 25, 2023

kmike Sep 26, 2023

kmike Sep 26, 2023

Gallaecio Sep 26, 2023

PyExplorer commented Sep 27, 2023

codecov bot commented Sep 27, 2023 •

edited

Loading

Gallaecio commented Sep 27, 2023

wRAR commented Sep 27, 2023

PyExplorer commented Sep 27, 2023

PyExplorer commented Sep 27, 2023

Identify scrapy-zyte-api usage via custom user-agent #130

Identify scrapy-zyte-api usage via custom user-agent #130

Conversation

PyExplorer commented Sep 21, 2023

Gallaecio left a comment • edited Loading

Choose a reason for hiding this comment

PyExplorer commented Sep 21, 2023 • edited by Gallaecio Loading

Gallaecio commented Sep 21, 2023 • edited Loading

PyExplorer commented Sep 21, 2023

PyExplorer commented Sep 22, 2023

PyExplorer commented Sep 25, 2023

kmike Sep 26, 2023

Choose a reason for hiding this comment

kmike Sep 26, 2023

Choose a reason for hiding this comment

Gallaecio Sep 26, 2023

Choose a reason for hiding this comment

PyExplorer commented Sep 27, 2023

codecov bot commented Sep 27, 2023 • edited Loading

Codecov Report

Gallaecio commented Sep 27, 2023

wRAR commented Sep 27, 2023

PyExplorer commented Sep 27, 2023

PyExplorer commented Sep 27, 2023

Gallaecio left a comment •

edited

Loading

PyExplorer commented Sep 21, 2023 •

edited by Gallaecio

Loading

Gallaecio commented Sep 21, 2023 •

edited

Loading

codecov bot commented Sep 27, 2023 •

edited

Loading