A proxy created with param proxy_urls crashes PlaywrightCrawler. #887

matecsaj · 2025-01-09T01:26:37Z

Test program.

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration

# If these go out of service then replace them with your own.
proxies = ['http://178.48.68.61:18080', 'http://198.245.60.202:3128', 'http://15.204.240.177:3128',]

proxy_configuration_fails = ProxyConfiguration(proxy_urls=proxies)

proxy_configuration_succeeds = ProxyConfiguration(
        tiered_proxy_urls=[
            # No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
            [None],
            # lower tier, cheaper, preferred as long as they work
            proxies,
            # higher tier, more expensive, used as a fallback
        ]
    )

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=5,  # Limit the crawl to 5 requests.
        headless=False,  # Show the browser window.
        browser_type='firefox',  # Use the Firefox browser.
        proxy_configuration = proxy_configuration_fails,
        # proxy_configuration=proxy_configuration_succeeds,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Enqueue all links found on the page.
        await context.enqueue_links()

        # Extract data from the page using Playwright API.
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
            'content': (await context.page.content())[:100],
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev'])

    # Export the entire dataset to a JSON file.
    await crawler.export_data('results.json')

    # Or work with the data directly.
    data = await crawler.get_data()
    crawler.log.info(f'Extracted data: {data.items}')

if __name__ == '__main__':
    asyncio.run(main())

Terminal output.

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_4.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.038974 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 138, in _open_page
          crawlee_page = await self._browser_pool.new_page(proxy_info=context.proxy_info)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/_utils/context.py", line 38, in async_wrapper
          return await method(self, *args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/browsers/_browser_pool.py", line 241, in new_page
          return await self._get_new_page(page_id, plugin, proxy_info)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/browsers/_browser_pool.py", line 270, in _get_new_page
          page = await asyncio.wait_for(
                 ^^^^^^^^^^^^^^^^^^^^^^^
          ...<5 lines>...
          )
          ^
        File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/asyncio/tasks.py", line 507, in wait_for
          return await fut
                 ^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/browsers/_playwright_browser_controller.py", line 119, in new_page
          self._browser_context = await self._create_browser_context(browser_new_context_options, proxy_info)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/browsers/_playwright_browser_controller.py", line 174, in _create_browser_context
          if browser_new_context_options['proxy']:
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
      KeyError: 'proxy'
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Error analysis: total_errors=3 unique_errors=1
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 0         │
│ requests_failed               │ 1         │
│ retry_histogram               │ [0, 0, 1] │
│ request_avg_failed_duration   │ 0.025703  │
│ request_avg_finished_duration │ None      │
│ requests_finished_per_minute  │ 0         │
│ requests_failed_per_minute    │ 14        │
│ request_total_duration        │ 0.025703  │
│ requests_total                │ 1         │
│ crawler_runtime               │ 4.189647  │
└───────────────────────────────┴───────────┘
[crawlee.storages._dataset] WARN  Attempting to export an empty dataset - no file will be created
[crawlee.crawlers._playwright._playwright_crawler] INFO  Extracted data: []

Process finished with exit code 0

The text was updated successfully, but these errors were encountered:

matecsaj · 2025-01-10T23:28:38Z

I retested with version 0.5.1 of Crawlee, and the issue with proxy_url persists.

While testing the tiered_proxy_urls parameter, I noticed a new behavior: even though I supplied a proxy, none was used, and the output showed proxy info as 'None.' I'm unsure if this is related to the same underlying issue or a separate problem.

I hope a new release with functioning proxy support is on the horizon, as my project is currently blocked without it.

Updated test code:

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration

# If these go out of service then replace them with your own.
proxies = [
    'http://20.121.139.25:3128',
    'http://67.43.227.230:30693',
    'http://175.158.57.136:7788',
    'http://112.198.178.35:8080',
]

proxy_configuration_fails = ProxyConfiguration(proxy_urls=proxies)

proxy_configuration_succeeds = ProxyConfiguration(
        tiered_proxy_urls=[
            # No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
            [None],
            # lower tier, cheaper, preferred as long as they work
            proxies,
            # higher tier, more expensive, used as a fallback
        ]
    )

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=5,  # Limit the crawl to 5 requests.
        headless=True,  # Show the browser window.
        browser_type='firefox',  # Use the Firefox browser.
        proxy_configuration=proxy_configuration_succeeds,
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} with proxy {context.proxy_info}')

        # Enqueue all links found on the page.
        await context.enqueue_links()


    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

Update output:

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_11.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.040465 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/python/ with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/docs/quick-start with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/docs/examples with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/api/core with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/api/core/changelog with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Processing https://crawlee.dev/blog with proxy None
[crawlee.crawlers._playwright._playwright_crawler] INFO  Pausing... Press CTRL+C again to force exit.
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish

Mantisus · 2025-01-11T00:02:12Z

Hi.

The PR with the required fix has not been merged yet. Until the necessary release, you can make a custom PlaywrightBrowserPlugin.

This would be similar to

import asyncio
from logging import getLogger

from playwright.async_api import ProxySettings

from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration

logger = getLogger(__name__)

class CustomPlaywrightBrowserController(PlaywrightBrowserController):
    async def _create_browser_context(self, browser_new_context_options = None, proxy_info = None):
        if self._header_generator:
            common_headers = self._header_generator.get_common_headers()
            sec_ch_ua_headers = self._header_generator.get_sec_ch_ua_headers(browser_type=self.browser_type)
            user_agent_header = self._header_generator.get_user_agent_header(browser_type=self.browser_type)
            extra_http_headers = dict(common_headers | sec_ch_ua_headers | user_agent_header)
        else:
            extra_http_headers = None

        browser_new_context_options = dict(browser_new_context_options) if browser_new_context_options else {}
        browser_new_context_options['extra_http_headers'] = browser_new_context_options.get(
            'extra_http_headers', extra_http_headers
        )

        if proxy_info:
            if browser_new_context_options.get('proxy'):
                logger.warning("browser_new_context_options['proxy'] overriden by explicit `proxy_info` argument.")

            browser_new_context_options['proxy'] = ProxySettings(
                server=f'{proxy_info.scheme}://{proxy_info.hostname}:{proxy_info.port}',
                username=proxy_info.username,
                password=proxy_info.password,
            )

        return await self._browser.new_context(**browser_new_context_options)


class CustomPlaywrightBrowserPlugin(PlaywrightBrowserPlugin):
    async def new_browser(self) -> CustomPlaywrightBrowserController:
        if not self._playwright:
            raise RuntimeError('Playwright browser plugin is not initialized.')

        if self._browser_type == 'chromium':
            browser = await self._playwright.chromium.launch(**self._browser_launch_options)
        elif self._browser_type == 'firefox':
            browser = await self._playwright.firefox.launch(**self._browser_launch_options)
        elif self._browser_type == 'webkit':
            browser = await self._playwright.webkit.launch(**self._browser_launch_options)
        else:
            raise ValueError(f'Invalid browser type: {self._browser_type}')

        return CustomPlaywrightBrowserController(
            browser,
            max_open_pages_per_browser=self._max_open_pages_per_browser,
        )


async def main() -> None:
    proxies = []
    proxy_configurations = ProxyConfiguration(proxy_urls=proxies)
    crawler = PlaywrightCrawler(
        browser_pool=BrowserPool(plugins=[CustomPlaywrightBrowserPlugin()]),
        max_requests_per_crawl=5,
        proxy_configuration = proxy_configurations,
    )

While testing the tiered_proxy_urls parameter, I noticed a new behavior: even though I supplied a proxy, none was used, and the output showed proxy info as 'None.' I'm unsure if this is related to the same underlying issue or a separate problem.

If you don't get a lock from the server, then this behavior is fully compliant with the documentation

matecsaj · 2025-01-11T16:36:59Z

Something else requires my attention at the moment. When I return to this, if the current version of Crawlee is still not functioning as expected, I’ll try the workaround kindly provided by @Mantisus.

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 9, 2025

matecsaj mentioned this issue Jan 9, 2025

How to always use new sessions and proxies for requests? #881

Open

Mantisus self-assigned this Jan 9, 2025

Mantisus linked a pull request Jan 9, 2025 that will close this issue

fix: restore proxy functionality for PlaywrightCrawler broken in v0.5 #889

Open

vdusek added the bug Something isn't working. label Jan 9, 2025

vdusek added this to the 105th sprint - Tooling team milestone Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A proxy created with param proxy_urls crashes PlaywrightCrawler. #887

A proxy created with param proxy_urls crashes PlaywrightCrawler. #887

matecsaj commented Jan 9, 2025

matecsaj commented Jan 10, 2025

Mantisus commented Jan 11, 2025

matecsaj commented Jan 11, 2025

A proxy created with param proxy_urls crashes PlaywrightCrawler. #887

A proxy created with param proxy_urls crashes PlaywrightCrawler. #887

Comments

matecsaj commented Jan 9, 2025

matecsaj commented Jan 10, 2025

Mantisus commented Jan 11, 2025

matecsaj commented Jan 11, 2025