Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify scrapy-zyte-api usage via custom user-agent #130

Merged
merged 29 commits into from
Sep 29, 2023

Conversation

PyExplorer
Copy link
Collaborator

Goals:

  • It should also allow to change the user agent, e.g. for zyte-crawlers use case (not implemented yet).
  • In the user agent we should keep the information about python-zyte-api version used (possible with the fix).

This is one of the options to set custom user-agent for scrapy-zyte-api and keep for python-zyte-api. But it requires changes in client.py python-zyte-api https://github.com/zytedata/python-zyte-api/blob/main/zyte_api/aio/client.py#L60 like:

    async def request_raw(self, query: dict, *,
                          endpoint: str = 'extract',
                          session=None,
                          handle_retries=True,
                          retrying: Optional[AsyncRetrying] = None,
                          ):
        retrying = retrying or self.retrying
        post = _post_func(session)
        auth = aiohttp.BasicAuth(self.api_key)
        headers = {
            'User-Agent': user_agent(aiohttp, extra_user_agent=query.pop("user-agent", None)),
            'Accept-Encoding': 'br'
        }
        response_stats = []
        start_global = time.perf_counter()

Some other options to provide the way to send user-agent from zyte-crawlers:
1.Use settings with name of package and set them during creating client (requires changes in AsyncClient)
2. Use cb_kwargs or meta - but looks like they are cleaned during the request process

It's good to discuss all above.

Copy link
Contributor

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably best to pass the user agent to AsyncClient as a parameter instead, once python-zyte-api supports that.

@PyExplorer
Copy link
Collaborator Author

PyExplorer commented Sep 21, 2023

It is probably best to pass the user agent to AsyncClient as a parameter instead, once python-zyte-api supports that.

@Gallaecio is it ok to change AsyncClient this way?

    def __init__(self, *,
                 api_key=None,
                 api_url=API_URL,
                 n_conn=15,
                 retrying: Optional[AsyncRetrying] = None,
                 custom_user_agent=None
                 ):

@Gallaecio
Copy link
Contributor

Gallaecio commented Sep 21, 2023

Sorry, accidentally edited your comment instead of answering 🤦

Answer: I think so, yes. Maybe even call it just user_agent, if the value is not None then it is custom 🙂 .

@PyExplorer
Copy link
Collaborator Author

It is probably best to pass the user agent to AsyncClient as a parameter instead, once python-zyte-api supports that.

aha, ok, with this. @Gallaecio How do you think we can send user-agent from zyte-crawlers? I was thinking about adding user-agent for python-zyte-api here https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/providers.py#L81 with meta (as only spider with zyte-crawlers go here), but probably it's better to send this right from zyte-crawlers repo with settings or else - any ideas?

@PyExplorer
Copy link
Collaborator Author

@Gallaecio could you take a look?
Related PR in python-zyte-api
zytedata/python-zyte-api#50

@PyExplorer
Copy link
Collaborator Author

@Gallaecio @kmike could you take a look?

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved
@@ -1,6 +1,10 @@
from importlib.metadata import version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can use it though, https://docs.python.org/3/library/importlib.metadata.html says it's 3.8+, while scrapy-zyte-api declares Python 3.7 support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scrapy-zyte-api declares Python 3.7 support

That has an easy fix :)

@PyExplorer
Copy link
Collaborator Author

@kmike @Gallaecio @wRAR @BurnzZ could you take a look at the PR with last changes from the discussion about version?

@codecov
Copy link

codecov bot commented Sep 27, 2023

Codecov Report

Merging #130 (502a8f8) into main (207fed4) will increase coverage by 0.00%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head 502a8f8 differs from pull request most recent head 7b72bfe. Consider uploading reports for the commit 7b72bfe to get more accurate results

@@           Coverage Diff           @@
##             main     #130   +/-   ##
=======================================
  Coverage   98.81%   98.82%           
=======================================
  Files           9       10    +1     
  Lines         673      678    +5     
=======================================
+ Hits          665      670    +5     
  Misses          8        8           
Files Coverage Δ
scrapy_zyte_api/__version__.py 100.00% <100.00%> (ø)
scrapy_zyte_api/handler.py 97.95% <100.00%> (+0.01%) ⬆️
scrapy_zyte_api/utils.py 100.00% <100.00%> (ø)

@Gallaecio
Copy link
Contributor

I think it was OK to leave setup.py the way it was and have bump2version configuration edit an extra file instead of a different file, but I am OK with the new approach as well.

The only thing left is Python 3.7 support, I think.

@wRAR
Copy link
Contributor

wRAR commented Sep 27, 2023

pinned-provider also failed, with "cannot import name 'USER_AGENT' from 'zyte_api.utils'"

Co-authored-by: Adrián Chaves <[email protected]>
@PyExplorer
Copy link
Collaborator Author

I think it was OK to leave setup.py the way it was and have bump2version configuration edit an extra file instead of a different file, but I am OK with the new approach as well.

The only thing left is Python 3.7 support, I think.

Oh, actually, I've made it as in python-zyte-api just for consistency

@PyExplorer
Copy link
Collaborator Author

pinned-provider also failed, with "cannot import name 'USER_AGENT' from 'zyte_api.utils'"

@wRAR what do you think will be the best option to fix it:

  • set "zyte-api>=0.4.7" in setup.py
  • fix test by importing and using only version from python-zyte-api like
from zyte_api.utils import __version__ as zyte_api_version
...
   "user_agent,expected",
    (
        (
            None,
            f"{USER_AGENT} python-zyte-api/{zyte_api_version}",
        ),
        (
            "zyte-crawlers/0.0.1",
            "zyte-crawlers/0.0.1",
        ),
    ),

scrapy_zyte_api/utils.py Outdated Show resolved Hide resolved
scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved
.bumpversion.cfg Show resolved Hide resolved
@kmike kmike merged commit d50b14d into scrapy-plugins:main Sep 29, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants