The Chrome variant of PlaywrightCrawler does not follow JavaScript redirects. Firefox does. #877

matecsaj · 2025-01-06T23:59:46Z

Run this, switch to Firefox, and run again. Chrome does not follow the JavaScript redirect as it should. The target website is sensitive to bots; you might find it necessary to add a proxy.

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext   # V0.5.0

async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_type='chrome'     # fails - it does not follow the JavaScript redirect
        # browser_type='firefox',     # works
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Request URL: {context.request.url}')
        context.log.info(f'Response URL Expected: https://pinside.com/pinball/machine/addams-family')
        context.log.info(f'Response URL Actual: {context.response.url}')
        context.log.info(f'HTML {await context.page.content()}')

    await crawler.run(['https://pinside.com/pinball/machine/2'])

if __name__ == '__main__':
    asyncio.run(main())

I have a vague memory of solving this exact problem on an old project, here a snippet of the code.

async with async_playwright() as p:
    browser = await p.chromium.launch(
                    args=["--disable-blink-features=AutomationControlled"],
                )
     context = await browser.new_context(
                    viewport={"width": 1920, "height": 1080},
                    user_agent=self.user_agent,
                    proxy=self.proxy,
                )
    page = await context.new_page()
    await page.goto(url, wait_until='networkidle').   # This might be what you need to fix the problem.

The text was updated successfully, but these errors were encountered:

janbuchar · 2025-01-07T11:12:03Z

Hi @matecsaj! Did you try this with headless=False to see what is actually going on? Or investigate what kind of redirect mechanism they use?

Asking just in case, otherwise we can do that as the next step when investigating this 🙂

matecsaj · 2025-01-07T16:20:02Z

Yes, I did.

When not running 'headless' something flashes on the screen. It was hard to read because it only appeared for a moment, so I used 'context.log.info(f'HTML {await context.page.content()}' to get a long look at the screen. From the HTML dump, I determined that a Javascript LIKELY does a redirect.

B4nan · 2025-01-07T17:03:46Z

Waiting for networkidle is something you should be doing as part of the request handler. Not everyone needs to wait for that, and it significantly slows down the page processing, which is why this cannot be done as a default.

I think you should be able to do await context.page.wait_for_load_state('networkidle') as the first thing in your handler to achieve that.

matecsaj · 2025-01-09T03:09:49Z

The target website is rejecting Chromium today, so I haven’t been able to determine whether your recommendation is effective when using Chromium. I’ve attempted multiple times throughout the day with different proxies, but the issue persists.

I also conducted a large test using Firefox and Camofox, and found that they do trigger the redirect, though not consistently. I’ve reduced the code to the essentials to clearly demonstrate the problem while incorporating your recommendation.

It’s possible that Crawlee is working as intended and the website’s anti-bot protection is employing clever tactics to discourage my attempts. Since I don’t specifically need to use Chromium, I’m content to let this go. Would you prefer to close this issue, or continue troubleshooting together?

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext   # V0.5.0
from crawlee.proxy_configuration import ProxyConfiguration

# If these go out of service then replace them with your own.
proxies = ['http://178.48.68.61:18080', 'http://198.245.60.202:3128', 'http://15.204.240.177:3128',]

proxy_configuration = ProxyConfiguration(
        tiered_proxy_urls=[
            # No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
            [None],
            # lower tier, cheaper, preferred as long as they work
            proxies,
            # higher tier, more expensive, used as a fallback
        ]
    )


async def main() -> None:
    crawler = PlaywrightCrawler(
        proxy_configuration=proxy_configuration,
        browser_type='chromium'     # fails - it does not follow the JavaScript redirect
        # browser_type='firefox',     # works
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:

        requested_url = context.request.url

        # wait for the page to completely load
        try:
            await context.page.wait_for_load_state("networkidle")
            await context.page.wait_for_load_state("domcontentloaded")
            # await context.page.wait_for_selector("div#someFinalContent")
        except TimeoutError as e:
            context.log.error(
                f"Timeout waiting for the page {requested_url} to load: {e}"
            )
            return
        else:
            await asyncio.sleep(5)  # Wait an additional ten seconds for good measure.

        # redirect check
        loaded_url = context.response.url
        if requested_url == loaded_url:
            context.log.error(f"Redirect failed on {context.request.url}")
        else:
            context.log.info(f'Redirect succeeded on {context.request.url} to {loaded_url}')

    await crawler.run(['https://pinside.com/pinball/machine/2'])

if __name__ == '__main__':
    asyncio.run(main())

Output when using Chromium.

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_6.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.054043 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1007, in __run_task_function
          await wait_for(
          ...<5 lines>...
          )
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/asyncio/tasks.py", line 507, in wait_for
          return await fut
                 ^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1105, in __run_request_handler
          await self._context_pipeline(context, self.router)
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 266, in _handle_blocked_request
          raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}')
      crawlee.errors.SessionError: Assuming the session is blocked based on HTTP status code 403
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Error analysis: total_errors=1 unique_errors=1
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌─────────────────────────────┬────────────────────────────┐
│ requests_finished           │ 0                          │
│ requests_failed             │ 1                          │
│ retry_histogram             │ [0, 0, 0, 0, 0, 0, 0, 0,   │
│                             │ 0, 1]                      │
│ request_avg_failed_duration │ 0.746194                   │
│ request_avg_finished_durat… │ None                       │
│ requests_finished_per_minu… │ 0                          │
│ requests_failed_per_minute  │ 6                          │
│ request_total_duration      │ 0.746194                   │
│ requests_total              │ 1                          │
│ crawler_runtime             │ 9.300798                   │
└─────────────────────────────┴────────────────────────────┘

Process finished with exit code 0

Output when using Firefox.

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_6.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.044338 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] INFO  Redirect succeeded on https://pinside.com/pinball/machine/2 to https://pinside.com/pinball/machine/addams-family
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 1         │
│ requests_failed               │ 0         │
│ retry_histogram               │ [1]       │
│ request_avg_failed_duration   │ None      │
│ request_avg_finished_duration │ 11.993283 │
│ requests_finished_per_minute  │ 5         │
│ requests_failed_per_minute    │ 0         │
│ request_total_duration        │ 11.993283 │
│ requests_total                │ 1         │
│ crawler_runtime               │ 13.239656 │
└───────────────────────────────┴───────────┘

Process finished with exit code 0

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Chrome variant of PlaywrightCrawler does not follow JavaScript redirects. Firefox does. #877

The Chrome variant of PlaywrightCrawler does not follow JavaScript redirects. Firefox does. #877

matecsaj commented Jan 6, 2025

janbuchar commented Jan 7, 2025

matecsaj commented Jan 7, 2025 •

edited

Loading

B4nan commented Jan 7, 2025

matecsaj commented Jan 9, 2025 •

edited

Loading

The Chrome variant of PlaywrightCrawler does not follow JavaScript redirects. Firefox does. #877

The Chrome variant of PlaywrightCrawler does not follow JavaScript redirects. Firefox does. #877

Comments

matecsaj commented Jan 6, 2025

janbuchar commented Jan 7, 2025

matecsaj commented Jan 7, 2025 • edited Loading

B4nan commented Jan 7, 2025

matecsaj commented Jan 9, 2025 • edited Loading

matecsaj commented Jan 7, 2025 •

edited

Loading

matecsaj commented Jan 9, 2025 •

edited

Loading