Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling lost connection with Playwright process that make scrape hangs in error #331

Open
milan-cp-dev opened this issue Dec 23, 2024 · 1 comment

Comments

@milan-cp-dev
Copy link

I have long scrapes with long sequences of actions needed to be taken during playwright scrape. I have handled most of the problems with scrape and established long runes. I am now facing issues with lost connection with the playwright process. We sure, can’t do much about process that died but please help me ensure that such a request ends up in errorback so we can properly handle it and continue scraping.

Minimum spider setup is described in the minimal_spider_setyp.txt

Route cause of error is:
/opt/scrapy_enviroment/lib/python3.11/site-packages/playwright/driver/playwright.sh: line 6: 2323044 Hangup "$PLAYWRIGHT_NODEJS_PATH" "$SCRIPT_PATH/package/lib/cli/cli.js" "$@"
/opt/scrapy_enviroment/lib/python3.11/site-packages/playwright/driver/playwright.sh: line 6: 2323042 Hangup "$PLAYWRIGHT_NODEJS_PATH" "$SCRIPT_PATH/package/lib/cli/cli.js" "$@"

To be able to raise awareness about that we have used ScrapyPlaywrightMemoryUsageExtension and we caught it as shown in example inital_error.txt

We have extended ScrapyPlaywrightMemoryUsageExtension to be able to try/catch such exceptions. We have attempted to raise some scrapy playwright known error to be able to route it back to errorback function that should handle remaining and proceed with scrape.

Can you please evaluate our CustomScrapyPlaywrightMemoryUsageExtension and advise if IgnoreRequest is a suitable exception and suggest what we can do moving forward? We are debugging the current solution as I am reporting this now.

minimal_spider_setyp.txt
inital_error.txt
custom_memusage_extension.txt

@milan-cp-dev
Copy link
Author

It seams that above handles such problems and scrape can continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant