Hello,
To do background jobs in Django, you normally use an asynchronous task queue like Celery or RQ. I would like to ask about a “in-the-background” use case that cannot be solved by a task queue, but may be solved by spawning additional threads.
My Django server implements a data analytics application that reads files from disk, and computes statistics over them. Because the server accesses these files frequently, I have a singleton object in our Django process that caches the contents of the files:
CACHE = CacheSingleton()
@api_view['GET']
def view(request):
...
# calls view_handler
def view_handler(file_name, operation):
content = CACHE.get_content(file_name)
if not content:
content = read_file_from_disk(file_name) # expensive
CACHE.set_content(file_name, content) # cache for later
return calculate(content, operation)
The users access these files in a series. If the user requests file a
, there is 90% chance the will also want file a2
.
Is it possible to launch a non-blocking thread in the background to pre-cache a2
when user accesses a1
? Something like this answer from stackoverflow:
import threading
CACHE = CacheSingleton()
def precache(file_name):
"""Call this function in a background thread."""
content = read_file_from_disk(file_name)
CACHE.set_content(file_name, content)
def view_handler(file_name, operation):
content = CACHE.get_content(file_name)
if not content:
content = read_file_from_disk(file_name)
CACHE.set_content(file_name, content)
if file_name == "a1" and "a2" not in CACHE:
t = threading.Thread(target=precache, args=["a2"])
t.setDaemon(True)
t.start()
return calculate(content, operation)
What are the foot guns here? Let’s assume I am running this on a synchronous gunicorn
with just 1 worker.
PS
I think I may be misusing gunicorn/Django here; the solution to my problem is to not cache data inside Django at all. The nature of the files makes it difficult to use a database (binary numpy files), so I want to see how far I can push my naive caching solution.