Using threads for pre-caching data "in the background"

Hello,

To do background jobs in Django, you normally use an asynchronous task queue like Celery or RQ. I would like to ask about a “in-the-background” use case that cannot be solved by a task queue, but may be solved by spawning additional threads.

My Django server implements a data analytics application that reads files from disk, and computes statistics over them. Because the server accesses these files frequently, I have a singleton object in our Django process that caches the contents of the files:

CACHE = CacheSingleton()

@api_view['GET']
def view(request):
      ...
     # calls view_handler

def view_handler(file_name, operation):
    content = CACHE.get_content(file_name)
    if not content:
        content = read_file_from_disk(file_name) # expensive
        CACHE.set_content(file_name, content) # cache for later

     return calculate(content, operation)

The users access these files in a series. If the user requests file a, there is 90% chance the will also want file a2.

Is it possible to launch a non-blocking thread in the background to pre-cache a2 when user accesses a1? Something like this answer from stackoverflow:

import threading

CACHE = CacheSingleton()

def precache(file_name):
     """Call this function in a background thread."""
     content = read_file_from_disk(file_name)
     CACHE.set_content(file_name, content)


def view_handler(file_name, operation):
    content = CACHE.get_content(file_name)
    if not content:
        content = read_file_from_disk(file_name)
        CACHE.set_content(file_name, content)

     if file_name == "a1" and "a2" not in CACHE:
        t = threading.Thread(target=precache, args=["a2"])
        t.setDaemon(True)
        t.start()

     return calculate(content, operation)

What are the foot guns here? Let’s assume I am running this on a synchronous gunicorn with just 1 worker.

PS

I think I may be misusing gunicorn/Django here; the solution to my problem is to not cache data inside Django at all. The nature of the files makes it difficult to use a database (binary numpy files), so I want to see how far I can push my naive caching solution.

1 Like

You don’t want to do this. Search this forum for the multiple topics addressing the issue of why trying to manage background threads from within a Django project is a bad idea.

To get started, see Django can't start multiple Threads (There are other threads along this line, this is just the first I found.)

For a reliable solution, you want to run this outside the context of the Django project. Either Celery or an external service process would be better.

1 Like