Apparent inexplicable database cache culling

I happened to discover that my database caching strategy that uses no expiration timeouts (everything is handled in our code), appears to be getting culled and I cannot figure out what’s causing it.

I happened to discover it because some newly loaded data revealed a ZeroDivisionError during an export operation. I had cleared caches before the export so the export itself was triggering cache creations, and I had a console print that showed the values as they were set in the cache table.

But when I started the export over again (in the django shell) from the start the next day, I was seeing that the cached values were being set again immediately! I tested all our caching code and could not reproduce the issue, so I started querying the size of the cache table periodically to monitor the growth. MAX_ENTRIES is set at 1,500,000, but I observed drops in the table size occasionally:

In [1]: from django.db import connections
   ...: from django.conf import settings
   ...: 
   ...: cache_settings = settings.CACHES.get("default", {})
   ...: 
   ...: table_name = cache_settings.get('LOCATION', 'django_cache')
   ...: 
   ...: db_alias = cache_settings.get('DATABASE', 'default')
   ...: 
   ...: connection = connections[db_alias]
   ...: 
   ...: with connection.cursor() as cursor:
   ...:     cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
   ...:     row = cursor.fetchone()
   ...:     count = row[0]
   ...: 

In [2]: count
Out[2]: 111269

...

In [6]: count
Out[6]: 107421

I saw it happen a few times (checking infrequently). At one point, it dropped down to 95K.

I’m using None for the timeouts when I set, so each cache table entry has an expirses value of Out[45]: ‘9999-12-31 23:59:59+00:00’.

This is the setting for CACHES:

In [1]: from django.conf import settings

In [2]: settings.CACHES
Out[2]: 
{'default': {'BACKEND': 'django.core.cache.backends.db.DatabaseCache',
  'LOCATION': 'tracebase_cache_table',
  'TIMEOUT': None,
  'OPTIONS': {'MAX_ENTRIES': 1500000},
  'KEY_PREFIX': 'PROD'}}

This is how we’re setting the cache values:

        cachekey = get_cache_key(rec, cache_func_name)
        cache.set(cachekey, value, timeout=None, version=1)
        if settings.DEBUG:
            print(f"Setting cache {cachekey} to {value}")

This is how we’re retrieving cached values:

        good_cache = True
        uncached = object()
        cachekey = get_cache_key(rec, cache_func_name)
        result = cache.get(cachekey, uncached)
        if result is uncached:
            result = None
            good_cache = False

And whenever we delete any caches, there’s a debug print, but I never see it during my export:

        delete_keys = []
        # For every cached property, delete the cache value
        for cached_function in self.get_my_cached_method_names():
            cache_key = get_cache_key(self, cached_function)
            if settings.DEBUG:
                print(f"Deleting cache {cache_key}")
            delete_keys.append(cache_key)
        if len(delete_keys) > 0:
            cache.delete_many(delete_keys)

I’ve asked our DB admin to see if they applied a culling mechanism without my knowledge.

This is an example of the console output from my cache set method that clued me in that caches that should have already existed were being set again:

Setting cache PeakGroupLabel.802620.enrichment_fraction to 0.0
Setting cache PeakGroupLabel.802620.enrichment_abundance to 0.0
Setting cache PeakGroupLabel.802620.normalized_labeling to None
Setting cache PeakGroupLabel.802621.enrichment_fraction to 0.0144748856194831
Setting cache PeakGroupLabel.802621.enrichment_abundance to 27084.94942029584
Setting cache PeakGroupLabel.802621.normalized_labeling to None
Setting cache PeakGroupLabel.802622.enrichment_fraction to 0.20433205129860907
Setting cache PeakGroupLabel.802622.enrichment_abundance to 4704198.803759875
Setting cache PeakGroupLabel.802622.normalized_labeling to None

I’m running out of ideas to figure out why my caches aren’t persisting. Do you guys have any suggestions? Do you think that somehow culling is happening? How can I confirm that or rule it out?

We’re on django 4.2.27.

Our DB admin said they don’t have a separate culling operation, so I’m still at a loss. I figure in order to rule out a Django bug, I would have to retry the export and put a debug print in _cull in the Django code, because there’s only 1 explicit cache value removal operation we have that doesn’t completely wipe out the cache table…

I thought sure my debugging would reveal how records were being deleted from our cache table, but it has only deepened the mystery. To get at the source, I extended DataBaseCache to try and catch and report on delete/cull events:

def get_num_cache_rows():
    cache_settings = settings.CACHES.get("default", {})
    table_name = cache_settings.get("LOCATION", "django_cache")
    db_alias = cache_settings.get("DATABASE", "default")
    connection = connections[db_alias]
    with connection.cursor() as cursor:
        cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
        row = cursor.fetchone()
        count = row[0]
    return count


class DebugDatabaseCache(DatabaseCache):
    """Override of DatabaseCache methods that delete cache entries to see why caches are apparently being culled way
    before max entries (1500000) is hit."""

    def _cull(self, db, cursor, now, num):
        from DataRepo.utils.exceptions import AggregatedErrors

        print(AggregatedErrors.get_trace())
        print("cache._cull CALLED!  See above trace")
        print(
            "CACHE DEBUG:\n"
            f"\t_cull(self, db={db}, cursor={cursor}, now={now}, num={num})\n"
            f"\tself._max_entries: {self.max_entries}\n"
            f"\tsettings.CACHES: {settings.caches}\n"
            f"Count BEFORE _cull: {get_num_cache_rows()}"
        )
        super()._cull(db, cursor, now, num)
        print(f"\tCount AFTER _cull: {get_num_cache_rows()}")

    def delete(self, key, version=None):
        from DataRepo.utils.exceptions import AggregatedErrors

        print(AggregatedErrors.get_trace())
        print("cache.delete CALLED!  See above trace")
        super().delete(key, version=version)

    def delete_many(self, keys, version=None):
        from DataRepo.utils.exceptions import AggregatedErrors

        print(AggregatedErrors.get_trace())
        print("cache.delete_many CALLED!  See above trace")
        super().delete_many(keys, version=version)

    def clear(self):
        from DataRepo.utils.exceptions import AggregatedErrors

        print(AggregatedErrors.get_trace())
        print("cache.clear CALLED!  See above trace")
        super().clear()

    def set(self, *args, **kwargs):
        print("SANITY CHECK: cache.set CALLED", end="\r")
        super().set(*args, **kwargs)

I updated the default CACHES in the settings to DataRepo.models.hier_cached_model.DebugDatabaseCache. Note, I extended the set method only as a sanity check, because my initial run showed none of the debug outputs when I observed the row count in the cache table drop.

I ran my export code again in one shell and added an additional check in another shell:

In [1]: from django.db import connections
   ...: from django.conf import settings
   ...: import time
   ...: 
   ...: cache_settings = settings.CACHES.get("default", {})
   ...: table_name = cache_settings.get('LOCATION', 'django_cache')
   ...: db_alias = cache_settings.get('DATABASE', 'default')
   ...: connection = connections[db_alias]
   ...: 
   ...: def get_cache_table_size():
   ...:     with connection.cursor() as cursor:
   ...:         cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
   ...:         row = cursor.fetchone()
   ...:         return row[0]
   ...: 
   ...: try:
   ...:     last_size = -1
   ...:     while True:
   ...:         size = get_cache_table_size()
   ...:         print(f"{time.strftime('%Y-%m-%d %H:%M:%S')}: {size}")
   ...:         if size < last_size:
   ...:             print("^^CULL EVENT!")
   ...:         last_size = size
   ...:         time.sleep(10)
   ...: except KeyboardInterrupt:
   ...:     print("\nLoop stopped by user.")
   ...: 

After a few tries, I got fewer and fewer cache updates (and incidentally, I’d put end=”\r” in my set_cache debug print so it wouldn’t overwhelm my console).

The confusing thing is that while I still see absolutely zero debug prints from the extended methods (aside from set), the other shell session is showing drops in the cache table size:

2026-01-30 11:33:57: 64692
2026-01-30 11:34:07: 64692
2026-01-30 11:34:17: 64692
2026-01-30 11:34:27: 64692
2026-01-30 11:34:37: 64692
2026-01-30 11:34:47: 64692
2026-01-30 11:34:57: 65009
2026-01-30 11:35:07: 65543
2026-01-30 11:35:17: 66114
2026-01-30 11:35:27: 66683
2026-01-30 11:35:37: 67245
2026-01-30 11:35:47: 67391
2026-01-30 11:35:57: 67391
2026-01-30 11:36:07: 67391
2026-01-30 11:36:17: 67391
2026-01-30 11:36:27: 67391
2026-01-30 11:36:37: 67391
2026-01-30 11:36:47: 67391
2026-01-30 11:36:57: 67391
2026-01-30 11:37:07: 67391
2026-01-30 11:37:17: 67391
2026-01-30 11:37:27: 67391
2026-01-30 11:37:37: 67391
2026-01-30 11:37:47: 67391
2026-01-30 11:37:57: 67391
2026-01-30 11:38:07: 67391
2026-01-30 11:38:17: 67391
2026-01-30 11:38:27: 67391
2026-01-30 11:38:38: 67391
2026-01-30 11:38:48: 67391
2026-01-30 11:38:58: 67391
2026-01-30 11:39:08: 67391
2026-01-30 11:39:18: 67391
2026-01-30 11:39:28: 67391
2026-01-30 11:39:38: 67391
2026-01-30 11:39:48: 67391
2026-01-30 11:39:58: 67391
2026-01-30 11:40:08: 67391
2026-01-30 11:40:18: 67391
2026-01-30 11:40:28: 67391
2026-01-30 11:40:38: 67391
2026-01-30 11:40:48: 67391
2026-01-30 11:40:58: 67391
2026-01-30 11:41:08: 67391
2026-01-30 11:41:18: 67391
2026-01-30 11:41:28: 67391
2026-01-30 11:41:38: 67391
2026-01-30 11:41:48: 67391
2026-01-30 11:41:58: 67391
2026-01-30 11:42:08: 66593
^^CULL EVENT!
2026-01-30 11:42:18: 61406
^^CULL EVENT!
2026-01-30 11:42:28: 55770
^^CULL EVENT!
2026-01-30 11:42:38: 51944
^^CULL EVENT!
2026-01-30 11:42:48: 51889
^^CULL EVENT!
2026-01-30 11:42:58: 51895
2026-01-30 11:43:08: 51895
2026-01-30 11:43:18: 51899
2026-01-30 11:43:28: 51899
2026-01-30 11:43:38: 51899
2026-01-30 11:43:48: 51899
2026-01-30 11:43:58: 51899
2026-01-30 11:44:08: 51899

I went and checked in the dev server’s logs and no one has been using the site today. the log file LMD’s were last updated yesterday:

[rleach@tb9-dev httpd]$ ls -halFt
total 37M
-rw-r--r--.  1 root root 823K Jan 30 11:14 access_log
-rw-r--r--.  1 root root 5.1K Jan 29 15:28 tracebase-access_log
-rw-r--r--.  1 root root  30K Jan 29 15:28 tracebase-zabbix_log
-rw-r--r--.  1 root root  57K Jan 29 15:14 tracebase-error_log
...

(The access log is all filled with zabbix activity.)

Is there any other method I should be extending to find out what could be deleting cache table rows?

OK, I found where the cache entry deletion appears to maybe(?) be happening. I added more method extensions and found that _base_delete_many is being called (just not apparently through either delete or delete_many because I don’t get those debug outputs.

This is my pseudo-trace that accompanies

  File "/var/www/tracebase/manage.py", line 22, in <module>                                                                                                                                                                                                                                                                 
    main()                                                                                                                                                                                                                                                                                                                  
  File "/var/www/tracebase/manage.py", line 18, in main                                                                                                                                                                                                                                                                     
    execute_from_command_line(sys.argv)                                                                                                                                                                                                                                                                                     
  File "<ipython-input-1-9cb78f3ff64b>", line 24, in <module>                                                                                                                                                                                                                                                               
    se.export()                                                                                                                                                                                                                                                                                                             
  File "/var/www/tracebase/DataRepo/utils/studies_exporter.py", line 110, in export                                                                                                                                                                                                                                         
    for line in AdvancedSearchDownloadView.tsv_template_iterator(                                                                                                                                                                                                                                                           
  File "/var/www/tracebase/DataRepo/views/search/download.py", line 161, in tsv_template_iterator                                                                                                                                                                                                                           
    yield rowtmplt.render({"qry": qry, "row": row})                                                                                                                                                                                                                                                                         
  File "/var/www/tracebase/DataRepo/models/hier_cached_model.py", line 128, in get_result                                                                                                                                                                                                                                   
    result, is_cache_good = get_cache(self, f.__name__)                                                                                                                                                                                                                                                                     
  File "/var/www/tracebase/DataRepo/models/hier_cached_model.py", line 154, in get_cache                                                                                                                                                                                                                                    
    result = cache.get(cachekey, uncached)                                                                                                                                                                                                                                                                                  
  File "/var/www/tracebase/DataRepo/models/hier_cached_model.py", line 85, in _base_delete_many                                                                                                                                                                                                                             
    print(AggregatedErrors.get_trace())                                                                                                                                                                                                                                                                                     
  File "/var/www/tracebase/DataRepo/utils/exceptions.py", line 2315, in get_trace                                                                                                                                                                                                                                           
    return cls.get_buffered_traceback_string()                                                                                                                                                                                                                                                                              
  File "/var/www/tracebase/DataRepo/utils/exceptions.py", line 2332, in get_buffered_traceback_string                                                                                                                                                                                                                       
    for step in traceback.format_stack()

This is the debug data that I print below the pseudo-trace:

cache._base_delete_many CALLED!  See above trace.  DEBUG INFO:                                                                                                                                                                                                                                                              
        _base_delete_many(self, keys=[])                                                                                                                                                                                                                                                                                    
        self._max_entries: 1500000                                                                                                                                                                                                                                                                                          
        settings.CACHES: {'default': {'BACKEND': 'DataRepo.models.hier_cached_model.DebugDatabaseCache', 'LOCATION': 'tracebase_cache_table', 'TIMEOUT': None, 'OPTIONS': {'MAX_ENTRIES': 1500000}, 'KEY_PREFIX': 'PROD'}}                                                                                                  
        Count BEFORE _base_delete_many: 10151                                                                                                                                                                                                                                                                               
        Count AFTER _base_delete_many: 10148                                                                                                                                                                                                                                                                                

When this starts happening, it happens over and over, consecutively. Here’s the immediate next one, but note that the “before” cache table row count is less than the after count from the previous message:

cache._base_delete_many CALLED!  See above trace.  DEBUG INFO:
        _base_delete_many(self, keys=[])
        self._max_entries: 1500000
        settings.CACHES: {'default': {'BACKEND': 'DataRepo.models.hier_cached_model.DebugDatabaseCache', 'LOCATION': 'tracebase_cache_table', 'TIMEOUT': None, 'OPTIONS': {'MAX_ENTRIES': 1500000}, 'KEY_PREFIX': 'PROD'}}
        Count BEFORE _base_delete_many: 10139
        Count AFTER _base_delete_many: 10136

So maybe the culling is happening concurrently/asynchronously?

This is my extension of the _base_delete_many method:

    def _base_delete_many(self, keys):
        from DataRepo.utils.exceptions import AggregatedErrors

        before = get_num_cache_rows()
        result = super()._base_delete_many(keys)
        after = get_num_cache_rows()

        if before > after:
            print(AggregatedErrors.get_trace())
            print(
                "cache._base_delete_many CALLED!  See above trace.  DEBUG INFO:\n"
                f"\t_base_delete_many(self, keys={keys})\n"
                f"\tself._max_entries: {self._max_entries}\n"
                f"\tsettings.CACHES: {settings.CACHES}\n"
                f"\tCount BEFORE _base_delete_many: {before}\n"
                f"\tCount AFTER _base_delete_many: {after}"
            )

        return result

I’m going to see if I can pass along the key and result of the cached value that is attempting to be retrieved from the call to cache.get() that is resulting in the call to _base_delete_many to see if that is significant.

Also, I realized that my pseudo-trace (by default) prunes the trace at the point where site-packages comes in. I’m going to turn that pruning off so I can see the full pseudo-trace.

Oh, and this point in the trace is in this method:

        good_cache = True
        uncached = object()
        cachekey = get_cache_key(rec, cache_func_name)
        result = cache.get(cachekey, uncached)  # <------- HERE
        if result is uncached:
            result = None
            good_cache = False

OK. Something is happening OUTSIDE this code, because I stopped my export and hadn’t stopped my loop monitoring the cache table size and it continued to detect culling events, so I don’t think my extended pseudo-trace will matter. It HAS to be (as I had suspected above when I noted that the numbers were discontiguous) happening somehow from other running code. Time to check the log file again… Maybe even check running background processes…

OMG. I found a tmux session doing a really long running inefficient method that involves occasional cache deletion. :man_facepalming: A refactor of that method has been on my todo list for a very long time.

Well, at least I learned a lot more about caching. FML

2 Likes