Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A proxy created with param proxy_urls crashes PlaywrightCrawler. #887

Open
matecsaj opened this issue Jan 9, 2025 · 3 comments · May be fixed by #889
Open

A proxy created with param proxy_urls crashes PlaywrightCrawler. #887

matecsaj opened this issue Jan 9, 2025 · 3 comments · May be fixed by #889
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@matecsaj
Copy link
Contributor

matecsaj commented Jan 9, 2025

Test program.

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration

# If these go out of service then replace them with your own.
proxies = ['http://178.48.68.61:18080', 'http://198.245.60.202:3128', 'http://15.204.240.177:3128',]

proxy_configuration_fails = ProxyConfiguration(proxy_urls=proxies)

proxy_configuration_succeeds = ProxyConfiguration(
        tiered_proxy_urls=[
            # No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
            [None],
            # lower tier, cheaper, preferred as long as they work
            proxies,
            # higher tier, more expensive, used as a fallback
        ]
    )

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=5,  # Limit the crawl to 5 requests.
        headless=False,  # Show the browser window.
        browser_type='firefox',  # Use the Firefox browser.
        proxy_configuration = proxy_configuration_fails,
        # proxy_configuration=proxy_configuration_succeeds,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Enqueue all links found on the page.
        await context.enqueue_links()

        # Extract data from the page using Playwright API.
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
            'content': (await context.page.content())[:100],
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev'])

    # Export the entire dataset to a JSON file.
    await crawler.export_data('results.json')

    # Or work with the data directly.
    data = await crawler.get_data()
    crawler.log.info(f'Extracted data: {data.items}')

if __name__ == '__main__':
    asyncio.run(main())

Terminal output.

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_4.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.038974 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 138, in _open_page
          crawlee_page = await self._browser_pool.new_page(proxy_info=context.proxy_info)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/_utils/context.py", line 38, in async_wrapper
          return await method(self, *args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/browsers/_browser_pool.py", line 241, in new_page
          return await self._get_new_page(page_id, plugin, proxy_info)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/browsers/_browser_pool.py", line 270, in _get_new_page
          page = await asyncio.wait_for(
                 ^^^^^^^^^^^^^^^^^^^^^^^
          ...<5 lines>...
          )
          ^
        File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/asyncio/tasks.py", line 507, in wait_for
          return await fut
                 ^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/browsers/_playwright_browser_controller.py", line 119, in new_page
          self._browser_context = await self._create_browser_context(browser_new_context_options, proxy_info)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/browsers/_playwright_browser_controller.py", line 174, in _create_browser_context
          if browser_new_context_options['proxy']:
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
      KeyError: 'proxy'
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Error analysis: total_errors=3 unique_errors=1
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 0         │
│ requests_failed               │ 1         │
│ retry_histogram               │ [0, 0, 1] │
│ request_avg_failed_duration   │ 0.025703  │
│ request_avg_finished_duration │ None      │
│ requests_finished_per_minute  │ 0         │
│ requests_failed_per_minute    │ 14        │
│ request_total_duration        │ 0.025703  │
│ requests_total                │ 1         │
│ crawler_runtime               │ 4.189647  │
└───────────────────────────────┴───────────┘
[crawlee.storages._dataset] WARN  Attempting to export an empty dataset - no file will be created
[crawlee.crawlers._playwright._playwright_crawler] INFO  Extracted data: []

Process finished with exit code 0
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 9, 2025
@Mantisus Mantisus self-assigned this Jan 9, 2025
@vdusek vdusek added the bug Something isn't working. label Jan 9, 2025
@vdusek vdusek added this to the 105th sprint - Tooling team milestone Jan 9, 2025
@matecsaj
Copy link
Contributor Author

I retested with version 0.5.1 of Crawlee, and the issue with proxy_url persists.

While testing the tiered_proxy_urls parameter, I noticed a new behavior: even though I supplied a proxy, none was used, and the output showed proxy info as 'None.' I'm unsure if this is related to the same underlying issue or a separate problem.

I hope a new release with functioning proxy support is on the horizon, as my project is currently blocked without it.

Updated test code:

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration

# If these go out of service then replace them with your own.
proxies = [
    'http://20.121.139.25:3128',
    'http://67.43.227.230:30693',
    'http://175.158.57.136:7788',
    'http://112.198.178.35:8080',
]

proxy_configuration_fails = ProxyConfiguration(proxy_urls=proxies)

proxy_configuration_succeeds = ProxyConfiguration(
        tiered_proxy_urls=[
            # No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
            [None],
            # lower tier, cheaper, preferred as long as they work
            proxies,
            # higher tier, more expensive, used as a fallback
        ]
    )

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=5,  # Limit the crawl to 5 requests.
        headless=True,  # Show the browser window.
        browser_type='firefox',  # Use the Firefox browser.
        proxy_configuration=proxy_configuration_succeeds,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} with proxy {context.proxy_info}')

        # Enqueue all links found on the page.
        await context.enqueue_links()


    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

Update output:

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_11.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.040465 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/python/ with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/docs/quick-start with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/docs/examples with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/api/core with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/api/core/changelog with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/blog with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Pausing... Press CTRL+C again to force exit.
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish

@Mantisus
Copy link
Collaborator

Hi.

The PR with the required fix has not been merged yet. Until the necessary release, you can make a custom PlaywrightBrowserPlugin.

This would be similar to

import asyncio
from logging import getLogger

from playwright.async_api import ProxySettings

from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration

logger = getLogger(__name__)

class CustomPlaywrightBrowserController(PlaywrightBrowserController):
    async def _create_browser_context(self, browser_new_context_options = None, proxy_info = None):
        if self._header_generator:
            common_headers = self._header_generator.get_common_headers()
            sec_ch_ua_headers = self._header_generator.get_sec_ch_ua_headers(browser_type=self.browser_type)
            user_agent_header = self._header_generator.get_user_agent_header(browser_type=self.browser_type)
            extra_http_headers = dict(common_headers | sec_ch_ua_headers | user_agent_header)
        else:
            extra_http_headers = None

        browser_new_context_options = dict(browser_new_context_options) if browser_new_context_options else {}
        browser_new_context_options['extra_http_headers'] = browser_new_context_options.get(
            'extra_http_headers', extra_http_headers
        )

        if proxy_info:
            if browser_new_context_options.get('proxy'):
                logger.warning("browser_new_context_options['proxy'] overriden by explicit `proxy_info` argument.")

            browser_new_context_options['proxy'] = ProxySettings(
                server=f'{proxy_info.scheme}://{proxy_info.hostname}:{proxy_info.port}',
                username=proxy_info.username,
                password=proxy_info.password,
            )

        return await self._browser.new_context(**browser_new_context_options)


class CustomPlaywrightBrowserPlugin(PlaywrightBrowserPlugin):
    async def new_browser(self) -> CustomPlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        if self._browser_type == 'chromium':
            browser = await self._playwright.chromium.launch(**self._browser_launch_options)
        elif self._browser_type == 'firefox':
            browser = await self._playwright.firefox.launch(**self._browser_launch_options)
        elif self._browser_type == 'webkit':
            browser = await self._playwright.webkit.launch(**self._browser_launch_options)
        else:
            raise ValueError(f'Invalid browser type: {self._browser_type}')

        return CustomPlaywrightBrowserController(
            browser,
            max_open_pages_per_browser=self._max_open_pages_per_browser,
        )


async def main() -> None:
    proxies = []
    proxy_configurations = ProxyConfiguration(proxy_urls=proxies)
    crawler = PlaywrightCrawler(
        browser_pool=BrowserPool(plugins=[CustomPlaywrightBrowserPlugin()]),
        max_requests_per_crawl=5,
        proxy_configuration = proxy_configurations,
    )

While testing the tiered_proxy_urls parameter, I noticed a new behavior: even though I supplied a proxy, none was used, and the output showed proxy info as 'None.' I'm unsure if this is related to the same underlying issue or a separate problem.

If you don't get a lock from the server, then this behavior is fully compliant with the documentation

@matecsaj
Copy link
Contributor Author

Something else requires my attention at the moment. When I return to this, if the current version of Crawlee is still not functioning as expected, I’ll try the workaround kindly provided by @Mantisus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants