How to use Django together with continously running Scrapy script

I am currently building a Django project that involves a script for web crawing running in the background. The web crawling happens once per day, which is why the web crawler script needs to run continously.

I could not figure out a way to run scripts on startup so my current method is to call the script in the views.py file for the activate function, which is called when via a specific url

def activate(request):
mainFunction() #web crawling script

This solution works but is very unideal. If someone were to visit the url multiple times, multiple instances of the script could start running in parallel or some other undesirable behavior may occur.

Below is my code for mainFunction() for reference (the continously running web crawling script)

def sleep(self, *args, seconds):
“”“Non blocking sleep callback”""
return deferLater(reactor, seconds, lambda: None)

def postASINs():
ASINs = []
for ASIN in Products.objects.all():
ASINs.append(ASIN)
filenameInput = ‘AmazonPriceTracker/crawler/input.csv’
with open(filenameInput, ‘w’) as file:
writer = csv.writer(file)
writer.writerow(ASINs)

def mainFunction():
now = datetime.datetime.now()
filenameOutput = ‘AmazonPriceTracker/crawler/output.json’
file = open(filenameOutput, ‘w’)
file.close()
process = CrawlerProcess(settings={
“FEEDS”: {
filenameOutput : {“format”: “json”},
},
})
def _crawl(result, spider):
file = open(filenameOutput, ‘w’)
file.close()
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print(‘waiting 10 seconds before restart…’))
deferred.addCallback(sleep, seconds=10)
deferred.addCallback(postASINs)
deferred.addCallback(_crawl, spider)
return deferred
postASINs()
_crawl(None, AmazonPriceSpider)
process.start()

So what would be the best way to implement this functionality?
Your help will be greatly greatly appreciated

Actually, it doesn’t need to run continuously if it’s only need to run once per day.

You also don’t want to try and tie this in any way to your main Django process, especially if you’re looking to run this in a “production quality” deployment.

Take a look at Celery and Celery Beat for managing periodic and background tasks.

There are other queues and background task managers available. There are multiple ways of doing this. But in general, this (or another method like this) is the direction you would want to go for a most stable and consistent environment.

Thanks for your reply Ken!

I had looked into celery but it looks like it’s going to be hard work setting it up with django, in addition to setting up its broker and backend. Is it all worth it for one simple script? Is there an easier and more noob-friendly way of doing this?

Django itself is fundamentally built around the request-response cycle. Anything outside of that really isn’t a good fit. You’re going to want to do this outside the normal server environment.

Like I said, there are other tools available.

If the only use of this script is as a daily task, you could probably do this quite easily by setting your script up as a custom admin script, and then running it once a day as a cron job.

Depending upon what the level of integration is between this script and the rest of your Django environment, you might not even need to make it an admin script. That’s really only necessary if you want it to have access to your Django models and/or other code within your project.

If it really is just a daily script without any real connection to your project, then a regular Python script running as a cron job may be all you need.

I will look into the custom admin scripts. Thanks for the link.

Cheers!