-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Chrome variant of PlaywrightCrawler does not follow JavaScript redirects. Firefox does. #877
Comments
Hi @matecsaj! Did you try this with Asking just in case, otherwise we can do that as the next step when investigating this 🙂 |
Yes, I did. When not running 'headless' something flashes on the screen. It was hard to read because it only appeared for a moment, so I used 'context.log.info(f'HTML {await context.page.content()}' to get a long look at the screen. From the HTML dump, I determined that a Javascript LIKELY does a redirect. |
Waiting for I think you should be able to do |
The target website is rejecting Chromium today, so I haven’t been able to determine whether your recommendation is effective when using Chromium. I’ve attempted multiple times throughout the day with different proxies, but the issue persists. I also conducted a large test using Firefox and Camofox, and found that they do trigger the redirect, though not consistently. I’ve reduced the code to the essentials to clearly demonstrate the problem while incorporating your recommendation. It’s possible that Crawlee is working as intended and the website’s anti-bot protection is employing clever tactics to discourage my attempts. Since I don’t specifically need to use Chromium, I’m content to let this go. Would you prefer to close this issue, or continue troubleshooting together? import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext # V0.5.0
from crawlee.proxy_configuration import ProxyConfiguration
# If these go out of service then replace them with your own.
proxies = ['http://178.48.68.61:18080', 'http://198.245.60.202:3128', 'http://15.204.240.177:3128',]
proxy_configuration = ProxyConfiguration(
tiered_proxy_urls=[
# No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
[None],
# lower tier, cheaper, preferred as long as they work
proxies,
# higher tier, more expensive, used as a fallback
]
)
async def main() -> None:
crawler = PlaywrightCrawler(
proxy_configuration=proxy_configuration,
browser_type='chromium' # fails - it does not follow the JavaScript redirect
# browser_type='firefox', # works
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
requested_url = context.request.url
# wait for the page to completely load
try:
await context.page.wait_for_load_state("networkidle")
await context.page.wait_for_load_state("domcontentloaded")
# await context.page.wait_for_selector("div#someFinalContent")
except TimeoutError as e:
context.log.error(
f"Timeout waiting for the page {requested_url} to load: {e}"
)
return
else:
await asyncio.sleep(5) # Wait an additional ten seconds for good measure.
# redirect check
loaded_url = context.response.url
if requested_url == loaded_url:
context.log.error(f"Redirect failed on {context.request.url}")
else:
context.log.info(f'Redirect succeeded on {context.request.url} to {loaded_url}')
await crawler.run(['https://pinside.com/pinball/machine/2'])
if __name__ == '__main__':
asyncio.run(main()) Output when using Chromium.
Output when using Firefox.
|
Run this, switch to Firefox, and run again. Chrome does not follow the JavaScript redirect as it should. The target website is sensitive to bots; you might find it necessary to add a proxy.
I have a vague memory of solving this exact problem on an old project, here a snippet of the code.
The text was updated successfully, but these errors were encountered: