-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapy-palywright cannot start working if the reactor is already installed #131
Comments
Hi, could you provide a minimal, reproducible example? I'm able to run a spider using the import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet.asyncioreactor import install as install_asyncio_reactor
class TestSpider(scrapy.Spider):
name = "example"
custom_settings = {
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}
def start_requests(self):
yield scrapy.Request(url="https://example.org", meta={"playwright": True})
def parse(self, response):
yield {"url": response.url}
if __name__ == "__main__":
install_asyncio_reactor()
from twisted.internet import reactor
configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
runner = CrawlerRunner()
d = runner.crawl(TestSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
|
I created a Django project "channels-scrapy" with two applications:
myapp.management.commands.crawl.py from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from django.core.management.base import BaseCommand
class Command(BaseCommand):
help = 'Runs the specified spider'
def add_arguments(self, parser):
parser.add_argument('spider', type=str, help="The spider name that to be located, instanced, and crawled.")
def handle(self, *args, **options):
# An asyncio Twisted reactor has already installed (AsyncioSelectorReactor object)
from twisted.internet import reactor
configure_logging()
runner = CrawlerRunner(settings=get_project_settings())
d = runner.crawl(options['spider'])
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
scrapy_app.spiders.py import scrapy
class TestSpider(scrapy.Spider):
name = "example"
# If you comment these settings, then no problem will appear.
custom_settings = {
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}
def start_requests(self):
yield scrapy.Request(url="https://example.org", meta={"playwright": True})
def parse(self, response, **kwargs):
yield {"url": response.url} scrapy_app.settings.py BOT_NAME = 'scrapy_app'
SPIDER_MODULES = ['scrapy_app.spiders']
NEWSPIDER_MODULE = 'scrapy_app.spiders'
ROBOTSTXT_OBEY = True
# No need to this setting. The reactor will be already installed from outside.
# TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" The INSTALLED_APPS list includes "daphne" app as mentioned in the channels documentation and 'myapp'. INSTALLED_APPS = [
"daphne",
"django.contrib.admin",
"django.contrib.auth",
"django.contrib.contenttypes",
"django.contrib.sessions",
"django.contrib.messages",
"django.contrib.staticfiles",
"myapp",
] In general Django "channels-scrapy" project looks like this:
Now when I run the spider example:
the application freezes and does not continue to work (note the line [asyncio] DEBUG: Using selector: KqueueSelector):
But if you turn off playwright, then everything will work fine and the line [asyncio] DEBUG: Using selector: KqueueSelector will disappear:
|
|
Yes, (unlike |
Are you filtering some logs out? I see some DEBUG messages in your post, but Scrapy also logs the reactor (and event loop, if present) at the beginning of the crawl, like:
or
|
Scrapy logs the reactor if the setting I only filtered [py.warnings]
|
What do you think about this DEBUG message? if I disable playwright, then this message will disappear: |
That's from the "Overriden settings" line, not the one from
I don't understand where this is installed. I'm not that familiar with |
I checked as follows: def handle(self, *args, **options):
current_reactor = sys.modules.get("twisted.internet.reactor", None)
print(isinstance(current_reactor, asyncioreactor.AsyncioSelectorReactor)) # True
print(current_reactor.running) # False
# An asyncio Twisted reactor has already installed (AsyncioSelectorReactor object)
from twisted.internet import reactor
configure_logging()
runner = CrawlerRunner(settings=get_project_settings())
d = runner.crawl(options['spider'])
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished |
How is it being installed? Where in the code is there something like the following? from twisted.internet.asyncioreactor import install
install() |
It being installed in the daphne.server.py module, which is imported in daphne.apps.py (Django app configuration module). daphne.server.py # This has to be done first as Twisted is import-order-sensitive with reactors
import asyncio # isort:skip
import os # isort:skip
import sys # isort:skip
import warnings # isort:skip
from concurrent.futures import ThreadPoolExecutor # isort:skip
from twisted.internet import asyncioreactor # isort:skip
twisted_loop = asyncio.new_event_loop()
if "ASGI_THREADS" in os.environ:
twisted_loop.set_default_executor(
ThreadPoolExecutor(max_workers=int(os.environ["ASGI_THREADS"]))
)
current_reactor = sys.modules.get("twisted.internet.reactor", None)
if current_reactor is not None:
if not isinstance(current_reactor, asyncioreactor.AsyncioSelectorReactor):
warnings.warn(
"Something has already installed a non-asyncio Twisted reactor. Attempting to uninstall it; "
+ "you can fix this warning by importing daphne.server early in your codebase or "
+ "finding the package that imports Twisted and importing it later on.",
UserWarning,
)
del sys.modules["twisted.internet.reactor"]
asyncioreactor.install(twisted_loop)
else:
asyncioreactor.install(twisted_loop) |
My settings:
My Scrapy app is running under another app (django-channels) that runs a reactor twisted.internet.asyncioreactor.AsyncioSelectorReactor in the process. Therefore, to run spiders by my custom Django management command, I use CrawlerRunner so as not to install a reactor that is already installed.
But in this case, Scrapy-palywright cannot start working. There is no line in logs like:
In order for Scrapy-palywright to start working properly, I have to:
Is there any idea to continue using the already installed reactor?
The text was updated successfully, but these errors were encountered: