-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A proxy created with param proxy_urls crashes PlaywrightCrawler. #887
Comments
I retested with version 0.5.1 of Crawlee, and the issue with proxy_url persists. While testing the tiered_proxy_urls parameter, I noticed a new behavior: even though I supplied a proxy, none was used, and the output showed proxy info as 'None.' I'm unsure if this is related to the same underlying issue or a separate problem. I hope a new release with functioning proxy support is on the horizon, as my project is currently blocked without it. Updated test code: import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
# If these go out of service then replace them with your own.
proxies = [
'http://20.121.139.25:3128',
'http://67.43.227.230:30693',
'http://175.158.57.136:7788',
'http://112.198.178.35:8080',
]
proxy_configuration_fails = ProxyConfiguration(proxy_urls=proxies)
proxy_configuration_succeeds = ProxyConfiguration(
tiered_proxy_urls=[
# No proxy tier. (Not needed, but optional in case you do not want to use any proxy on lowest tier.)
[None],
# lower tier, cheaper, preferred as long as they work
proxies,
# higher tier, more expensive, used as a fallback
]
)
async def main() -> None:
crawler = PlaywrightCrawler(
max_requests_per_crawl=5, # Limit the crawl to 5 requests.
headless=True, # Show the browser window.
browser_type='firefox', # Use the Firefox browser.
proxy_configuration=proxy_configuration_succeeds,
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} with proxy {context.proxy_info}')
# Enqueue all links found on the page.
await context.enqueue_links()
# Run the crawler with the initial list of URLs.
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main()) Update output:
|
Hi. The PR with the required fix has not been merged yet. Until the necessary release, you can make a custom This would be similar to import asyncio
from logging import getLogger
from playwright.async_api import ProxySettings
from crawlee.browsers import BrowserPool, PlaywrightBrowserController, PlaywrightBrowserPlugin
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
logger = getLogger(__name__)
class CustomPlaywrightBrowserController(PlaywrightBrowserController):
async def _create_browser_context(self, browser_new_context_options = None, proxy_info = None):
if self._header_generator:
common_headers = self._header_generator.get_common_headers()
sec_ch_ua_headers = self._header_generator.get_sec_ch_ua_headers(browser_type=self.browser_type)
user_agent_header = self._header_generator.get_user_agent_header(browser_type=self.browser_type)
extra_http_headers = dict(common_headers | sec_ch_ua_headers | user_agent_header)
else:
extra_http_headers = None
browser_new_context_options = dict(browser_new_context_options) if browser_new_context_options else {}
browser_new_context_options['extra_http_headers'] = browser_new_context_options.get(
'extra_http_headers', extra_http_headers
)
if proxy_info:
if browser_new_context_options.get('proxy'):
logger.warning("browser_new_context_options['proxy'] overriden by explicit `proxy_info` argument.")
browser_new_context_options['proxy'] = ProxySettings(
server=f'{proxy_info.scheme}://{proxy_info.hostname}:{proxy_info.port}',
username=proxy_info.username,
password=proxy_info.password,
)
return await self._browser.new_context(**browser_new_context_options)
class CustomPlaywrightBrowserPlugin(PlaywrightBrowserPlugin):
async def new_browser(self) -> CustomPlaywrightBrowserController:
if not self._playwright:
raise RuntimeError('Playwright browser plugin is not initialized.')
if self._browser_type == 'chromium':
browser = await self._playwright.chromium.launch(**self._browser_launch_options)
elif self._browser_type == 'firefox':
browser = await self._playwright.firefox.launch(**self._browser_launch_options)
elif self._browser_type == 'webkit':
browser = await self._playwright.webkit.launch(**self._browser_launch_options)
else:
raise ValueError(f'Invalid browser type: {self._browser_type}')
return CustomPlaywrightBrowserController(
browser,
max_open_pages_per_browser=self._max_open_pages_per_browser,
)
async def main() -> None:
proxies = []
proxy_configurations = ProxyConfiguration(proxy_urls=proxies)
crawler = PlaywrightCrawler(
browser_pool=BrowserPool(plugins=[CustomPlaywrightBrowserPlugin()]),
max_requests_per_crawl=5,
proxy_configuration = proxy_configurations,
)
If you don't get a lock from the server, then this behavior is fully compliant with the documentation |
Something else requires my attention at the moment. When I return to this, if the current version of Crawlee is still not functioning as expected, I’ll try the workaround kindly provided by @Mantisus. |
Test program.
Terminal output.
The text was updated successfully, but these errors were encountered: