Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Chrome variant of PlaywrightCrawler does not follow JavaScript redirects. Firefox does. #877

Open
matecsaj opened this issue Jan 6, 2025 · 4 comments
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@matecsaj
Copy link
Contributor

matecsaj commented Jan 6, 2025

Run this, switch to Firefox, and run again. Chrome does not follow the JavaScript redirect as it should. The target website is sensitive to bots; you might find it necessary to add a proxy.

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext   # V0.5.0

async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_type='chrome'     # fails - it does not follow the JavaScript redirect
        # browser_type='firefox',     # works
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Request URL: {context.request.url}')
        context.log.info(f'Response URL Expected: https://pinside.com/pinball/machine/addams-family')
        context.log.info(f'Response URL Actual: {context.response.url}')
        context.log.info(f'HTML {await context.page.content()}')

    await crawler.run(['https://pinside.com/pinball/machine/2'])

if __name__ == '__main__':
    asyncio.run(main())

I have a vague memory of solving this exact problem on an old project, here a snippet of the code.

async with async_playwright() as p:
    browser = await p.chromium.launch(
                    args=["--disable-blink-features=AutomationControlled"],
                )
     context = await browser.new_context(
                    viewport={"width": 1920, "height": 1080},
                    user_agent=self.user_agent,
                    proxy=self.proxy,
                )
    page = await context.new_page()
    await page.goto(url, wait_until='networkidle').   # This might be what you need to fix the problem.
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 7, 2025
@janbuchar
Copy link
Collaborator

Hi @matecsaj! Did you try this with headless=False to see what is actually going on? Or investigate what kind of redirect mechanism they use?

Asking just in case, otherwise we can do that as the next step when investigating this 🙂

@matecsaj
Copy link
Contributor Author

matecsaj commented Jan 7, 2025

Yes, I did.

When not running 'headless' something flashes on the screen. It was hard to read because it only appeared for a moment, so I used 'context.log.info(f'HTML {await context.page.content()}' to get a long look at the screen. From the HTML dump, I determined that a Javascript LIKELY does a redirect.

@B4nan
Copy link
Member

B4nan commented Jan 7, 2025

Waiting for networkidle is something you should be doing as part of the request handler. Not everyone needs to wait for that, and it significantly slows down the page processing, which is why this cannot be done as a default.

I think you should be able to do await context.page.wait_for_load_state('networkidle') as the first thing in your handler to achieve that.

@matecsaj
Copy link
Contributor Author

matecsaj commented Jan 9, 2025

The target website is rejecting Chromium today, so I haven’t been able to determine whether your recommendation is effective when using Chromium. I’ve attempted multiple times throughout the day with different proxies, but the issue persists.

I also conducted a large test using Firefox and Camofox, and found that they do trigger the redirect, though not consistently. I’ve reduced the code to the essentials to clearly demonstrate the problem while incorporating your recommendation.

It’s possible that Crawlee is working as intended and the website’s anti-bot protection is employing clever tactics to discourage my attempts. Since I don’t specifically need to use Chromium, I’m content to let this go. Would you prefer to close this issue, or continue troubleshooting together?

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext   # V0.5.0
from crawlee.proxy_configuration import ProxyConfiguration

# If these go out of service then replace them with your own.
proxies = ['http://178.48.68.61:18080', 'http://198.245.60.202:3128', 'http://15.204.240.177:3128',]

proxy_configuration = ProxyConfiguration(
        tiered_proxy_urls=[
            # No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
            [None],
            # lower tier, cheaper, preferred as long as they work
            proxies,
            # higher tier, more expensive, used as a fallback
        ]
    )


async def main() -> None:
    crawler = PlaywrightCrawler(
        proxy_configuration=proxy_configuration,
        browser_type='chromium'     # fails - it does not follow the JavaScript redirect
        # browser_type='firefox',     # works
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:

        requested_url = context.request.url

        # wait for the page to completely load
        try:
            await context.page.wait_for_load_state("networkidle")
            await context.page.wait_for_load_state("domcontentloaded")
            # await context.page.wait_for_selector("div#someFinalContent")
        except TimeoutError as e:
            context.log.error(
                f"Timeout waiting for the page {requested_url} to load: {e}"
            )
            return
        else:
            await asyncio.sleep(5)  # Wait an additional ten seconds for good measure.

        # redirect check
        loaded_url = context.response.url
        if requested_url == loaded_url:
            context.log.error(f"Redirect failed on {context.request.url}")
        else:
            context.log.info(f'Redirect succeeded on {context.request.url} to {loaded_url}')

    await crawler.run(['https://pinside.com/pinball/machine/2'])

if __name__ == '__main__':
    asyncio.run(main())

Output when using Chromium.

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_6.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.054043 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] WARN  Encountered a session error, rotating session and retrying
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1007, in __run_task_function
          await wait_for(
          ...<5 lines>...
          )
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/asyncio/tasks.py", line 507, in wait_for
          return await fut
                 ^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_basic_crawler.py", line 1105, in __run_request_handler
          await self._context_pipeline(context, self.router)
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 65, in __call__
          result = await middleware_instance.__anext__()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_playwright/_playwright_crawler.py", line 266, in _handle_blocked_request
          raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}')
      crawlee.errors.SessionError: Assuming the session is blocked based on HTTP status code 403
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Error analysis: total_errors=1 unique_errors=1
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌─────────────────────────────┬────────────────────────────┐
│ requests_finished           │ 0                          │
│ requests_failed             │ 1                          │
│ retry_histogram             │ [0, 0, 0, 0, 0, 0, 0, 0,   │
│                             │ 0, 1]                      │
│ request_avg_failed_duration │ 0.746194                   │
│ request_avg_finished_durat… │ None                       │
│ requests_finished_per_minu… │ 0                          │
│ requests_failed_per_minute  │ 6                          │
│ request_total_duration      │ 0.746194                   │
│ requests_total              │ 1                          │
│ crawler_runtime             │ 9.300798                   │
└─────────────────────────────┴────────────────────────────┘

Process finished with exit code 0

Output when using Firefox.

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_6.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.044338 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] INFO  Redirect succeeded on https://pinside.com/pinball/machine/2 to https://pinside.com/pinball/machine/addams-family
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 1         │
│ requests_failed               │ 0         │
│ retry_histogram               │ [1]       │
│ request_avg_failed_duration   │ None      │
│ request_avg_finished_duration │ 11.993283 │
│ requests_finished_per_minute  │ 5         │
│ requests_failed_per_minute    │ 0         │
│ request_total_duration        │ 11.993283 │
│ requests_total                │ 1         │
│ crawler_runtime               │ 13.239656 │
└───────────────────────────────┴───────────┘

Process finished with exit code 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants