I am currently building a Django project that involves a script for web crawing running in the background. The web crawling happens once per day, which is why the web crawler script needs to run continously.
I could not figure out a way to run scripts on startup so my current method is to call the script in the views.py file for the activate function, which is called when via a specific url
def activate(request):
mainFunction() #web crawling script
This solution works but is very unideal. If someone were to visit the url multiple times, multiple instances of the script could start running in parallel or some other undesirable behavior may occur.
Below is my code for mainFunction() for reference (the continously running web crawling script)
def sleep(self, *args, seconds):
“”“Non blocking sleep callback”“”
return deferLater(reactor, seconds, lambda: None)def postASINs():
ASINs =
for ASIN in Products.objects.all():
ASINs.append(ASIN)
filenameInput = ‘AmazonPriceTracker/crawler/input.csv’
with open(filenameInput, ‘w’) as file:
writer = csv.writer(file)
writer.writerow(ASINs)def mainFunction():
now = datetime.datetime.now()
filenameOutput = ‘AmazonPriceTracker/crawler/output.json’
file = open(filenameOutput, ‘w’)
file.close()
process = CrawlerProcess(settings={
“FEEDS”: {
filenameOutput : {“format”: “json”},
},
})
def _crawl(result, spider):
file = open(filenameOutput, ‘w’)
file.close()
deferred = process.crawl(spider)
deferred.addCallback(lambda results: print(‘waiting 10 seconds before restart…’))
deferred.addCallback(sleep, seconds=10)
deferred.addCallback(postASINs)
deferred.addCallback(_crawl, spider)
return deferred
postASINs()
_crawl(None, AmazonPriceSpider)
process.start()
So what would be the best way to implement this functionality?
Your help will be greatly greatly appreciated