Web Scraping with Django and Scrapy via the Zyte API

I want to use Scrapy via the Zyte API with Django. I run the scraper with CrawlerProcess from views.py:

from django.http import HttpResponse
from myscrapyproject.spiders.dummy_spider import DummySpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

def scrape_dummy(request):

    settings = get_project_settings()

    process = CrawlerProcess(settings)
    process.crawl(DummySpider)
    process.start()

    return HttpResponse("Finished.")

I got this from Common Practices — Scrapy 2.11.2 documentation.

Without the Zyte API, everything works fine. Then I add the the following lines to Scrapy’s settings.php

ADDONS = {
    "scrapy_zyte_api.Addon": 500,
}
ZYTE_API_KEY = "*********"

Now, when I run Django and open the URL that leads to the view in my browser, the following happens:

% python manage.py runserver
Watching for file changes with StatReloader
Performing system checks...

System check identified no issues (0 silenced).
July 01, 2024 - 09:11:01
Django version 5.0.6, using settings 'myproject.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

[01/Jul/2024 09:11:06] "GET /dummy HTTP/1.1" 301 0
2024-07-01 09:11:06 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: myscrapyproject)
2024-07-01 09:11:06 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.12.4 (main, Jun 14 2024, 08:47:19) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform macOS-14.5-arm64-arm-64bit
2024-07-01 09:11:06 [scrapy.addons] INFO: Enabled addons:
[<scrapy_zyte_api.addon.Addon object at 0x10864bcb0>]
2024-07-01 09:11:06 [asyncio] DEBUG: Using selector: KqueueSelector
2024-07-01 09:11:06 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-01 09:11:06 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-07-01 09:11:06 [scrapy.extensions.telnet] INFO: Telnet Password: 1fad741635ec628f
2024-07-01 09:11:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-07-01 09:11:06 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'myscrapyproject',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'myscrapyproject.spiders',
 'REQUEST_FINGERPRINTER_CLASS': 'scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['myscrapyproject.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-01 09:11:06 [scrapy_zyte_api.handler] INFO: Using a Zyte API key starting with '49aadba'
2024-07-01 09:11:06 [scrapy_zyte_api.handler] INFO: Using a Zyte API key starting with '49aadba'
2024-07-01 09:11:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 <class 'scrapy_zyte_api._middlewares.ScrapyZyteAPIDownloaderMiddleware'>,
 <class 'scrapy_zyte_api._session.ScrapyZyteAPISessionDownloaderMiddleware'>,
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-01 09:11:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 <class 'scrapy_zyte_api._middlewares.ScrapyZyteAPISpiderMiddleware'>,
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-01 09:11:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-01 09:11:06 [scrapy.core.engine] INFO: Spider opened
2024-07-01 09:11:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-01 09:11:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
Unhandled Error
Traceback (most recent call last):
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 504, in fireEvent
    DeferredList(beforeResults).addCallback(self._continueFiring)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 623, in addCallback
    return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 536, in addCallbacks
    self._runCallbacks()
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 1078, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
--- <exception caught here> ---
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 516, in _continueFiring
    callable(*args, **kwargs)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 955, in _reallyStartRunning
    self._signals.install()
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/_signals.py", line 190, in install
    d.install()
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/_signals.py", line 149, in install
    signal.signal(signal.SIGINT, self._sigInt)
  File "/Users/ralf/.pyenv/versions/3.12.4/lib/python3.12/signal.py", line 58, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
builtins.ValueError: signal only works in main thread of the main interpreter

2024-07-01 09:11:06 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 504, in fireEvent
    DeferredList(beforeResults).addCallback(self._continueFiring)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 623, in addCallback
    return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 536, in addCallbacks
    self._runCallbacks()
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 1078, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
--- <exception caught here> ---
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 516, in _continueFiring
    callable(*args, **kwargs)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 955, in _reallyStartRunning
    self._signals.install()
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/_signals.py", line 190, in install
    d.install()
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/_signals.py", line 149, in install
    signal.signal(signal.SIGINT, self._sigInt)
  File "/Users/ralf/.pyenv/versions/3.12.4/lib/python3.12/signal.py", line 58, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
builtins.ValueError: signal only works in main thread of the main interpreter

Unhandled Error
Traceback (most recent call last):
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 504, in fireEvent
    DeferredList(beforeResults).addCallback(self._continueFiring)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 623, in addCallback
    return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 536, in addCallbacks
    self._runCallbacks()
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 1078, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
--- <exception caught here> ---
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 516, in _continueFiring
    callable(*args, **kwargs)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/scrapy/utils/ossignal.py", line 26, in install_shutdown_handlers
    signal.signal(signal.SIGTERM, function)
  File "/Users/ralf/.pyenv/versions/3.12.4/lib/python3.12/signal.py", line 58, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
builtins.ValueError: signal only works in main thread of the main interpreter

2024-07-01 09:11:06 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 504, in fireEvent
    DeferredList(beforeResults).addCallback(self._continueFiring)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 623, in addCallback
    return self.addCallbacks(callback, callbackArgs=args, callbackKeywords=kwargs)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 536, in addCallbacks
    self._runCallbacks()
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/defer.py", line 1078, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
--- <exception caught here> ---
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/twisted/internet/base.py", line 516, in _continueFiring
    callable(*args, **kwargs)
  File "/Users/ralf/.pyenv/versions/django-scrapy_3124/lib/python3.12/site-packages/scrapy/utils/ossignal.py", line 26, in install_shutdown_handlers
    signal.signal(signal.SIGTERM, function)
  File "/Users/ralf/.pyenv/versions/3.12.4/lib/python3.12/signal.py", line 58, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
builtins.ValueError: signal only works in main thread of the main interpreter

2024-07-01 09:11:11 [zyte_api._retry] DEBUG: Starting call to 'zyte_api._async.AsyncZyteAPI.get.<locals>.request', this is the 1st time calling it.
2024-07-01 09:11:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.djangoproject.com/robots.txt> (referer: None) ['zyte-api']
2024-07-01 09:11:12 [zyte_api._retry] DEBUG: Starting call to 'zyte_api._async.AsyncZyteAPI.get.<locals>.request', this is the 1st time calling it.
2024-07-01 09:11:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://forum.djangoproject.com> (referer: None) ['zyte-api']
2024-07-01 09:11:14 [scrapy.core.engine] INFO: Closing spider (finished)
2024-07-01 09:11:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 466,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 159369,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 7.554092,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 7, 1, 9, 11, 14, 204464, tzinfo=datetime.timezone.utc),
 'log_count/CRITICAL': 2,
 'log_count/DEBUG': 7,
 'log_count/INFO': 12,
 'memusage/max': 92815360,
 'memusage/startup': 92815360,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'scrapy-zyte-api/429': 0,
 'scrapy-zyte-api/attempts': 2,
 'scrapy-zyte-api/error_ratio': 0.0,
 'scrapy-zyte-api/errors': 0,
 'scrapy-zyte-api/fatal_errors': 0,
 'scrapy-zyte-api/mean_connection_seconds': 1.0916275415220298,
 'scrapy-zyte-api/mean_response_seconds': 1.2125371670117602,
 'scrapy-zyte-api/processed': 2,
 'scrapy-zyte-api/request_args/httpResponseBody': 2,
 'scrapy-zyte-api/request_args/httpResponseHeaders': 2,
 'scrapy-zyte-api/request_args/url': 2,
 'scrapy-zyte-api/sessions/use/disabled': 2,
 'scrapy-zyte-api/status_codes/200': 2,
 'scrapy-zyte-api/success': 2,
 'scrapy-zyte-api/success_ratio': 1.0,
 'scrapy-zyte-api/throttle_ratio': 0.0,
 'start_time': datetime.datetime(2024, 7, 1, 9, 11, 6, 650372, tzinfo=datetime.timezone.utc)}
2024-07-01 09:11:14 [scrapy.core.engine] INFO: Spider closed (finished)
[01/Jul/2024 09:11:14] "GET /dummy/ HTTP/1.1" 200 9

The browser shows “Finished.” and the Zyte API seems to work. I just wonder about this unhandled error “signal only works in main thread of the main interpreter”.

I get the same error, but only once and the browser hangs, when I try to run the scraper from views.py with CrawlerRunner

from django.http import HttpResponse
from myscrapyproject.spiders.dummy_spider import DummySpider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor


def scrape_dummy(request):

    settings = get_project_settings()

    runner = CrawlerRunner(settings)
    deferred = runner.crawl(DummySpider)
    deferred.addBoth(lambda _: reactor.stop())
    reactor.run()

    return HttpResponse("Finished.")

When I change the last line to

reactor.run(installSignalHandlers=False)

the browser simply hangs, without any message. (I found something about the installSignalHandlers attribute somewhere, do not remember where.) Using VS Code (in debug mode) or simply python manage.py runserver makes no difference, by the way. Just for the sake of completeness, this is my spider:

from scrapy import Request, Spider


class DummySpider(Spider):
    name = "dummy_spider"

    def start_requests(self):
        url = "https://forum.djangoproject.com"
        yield Request(url, callback=self.parse)

    def parse(self, response):
        pass

I found this and this post dealing with Scrapy and Django, but only without the Zyte API. Can anybody help to make it also work with the Zyte API?