Invalidating Cache doesn't happen automatically

creyD · May 18, 2022, 9:51am

A cached page should be invalidated when something changes at the underlying resource. I probably don’t understand enough of the caching, so please help me out here.

Django implements “site-wide” caching and I used this guide to setup my caching infrastructure (locally to test for now).

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.locmem.LocMemCache',
    }
}
SESSION_ENGINE = "django.contrib.sessions.backends.cache"
SESSION_CACHE_ALIAS = "default"
CACHE_TTL = 60 * 60 * 1
CACHE_MIDDLEWARE_SECONDS = CACHE_TTL

When I run my tests now, it works too well, as all tests fail, which change information in the database. Do I have to manually add a flag for cache invalidation? Should this even work (or are my settings wrong)? I saw this question, but in my case I don’t use any decorators, but the side wide caching. Nevertheless I tested using the vary_on_headers('Authorization',) which didn’t change the test results.

KenWhitesell · May 18, 2022, 12:08pm

I would disable all caches for testing.

From the docs:

FetchFromCacheMiddleware caches GET and HEAD responses with status 200, where the request and response headers allow.

This means that subsequent requests to the same url are going to return the same response until the cache expires. Your view doesn’t even execute in those cases.

You could probably change your tests to ensure that they don’t retrieve cached versions, but I’m not sure it’s worth the effort in that direction.

(See the source code for the middleware in django.middleware.cache for more detailed information.)

creyD · May 18, 2022, 12:13pm

Okay I see the point, but wouldn’t this happen in the live version as well? If someone sends a request, the underlying data changes and then resends the same request, get the same answer?

Also shouldn’t a cache be invalidated automatically if the db/ underlying data changes (no matter the expiry date)?

Also here is the list of my middleware, maybe there is something wrong here:

MIDDLEWARE = [
    'django.middleware.cache.UpdateCacheMiddleware',
    'django.middleware.security.SecurityMiddleware',
    'django.contrib.sessions.middleware.SessionMiddleware',
    'corsheaders.middleware.CorsMiddleware',
    'django.middleware.common.CommonMiddleware',
    'django.middleware.csrf.CsrfViewMiddleware',
    'django.contrib.auth.middleware.AuthenticationMiddleware',
    'django.contrib.messages.middleware.MessageMiddleware',
    'django.middleware.clickjacking.XFrameOptionsMiddleware',
    'hq.middlewares.ThreadLocals',
    'models_logging.middleware.LoggingStackMiddleware',
    'django.middleware.cache.FetchFromCacheMiddleware',
]

KenWhitesell · May 18, 2022, 12:38pm

Correct, for the life of the cache.

Not necessarily - it really depends upon the application. Not all applications require “up-to-the-second” page accuracy.

(There’s also the issue with trying to determine which pages are affected by underlying data changes. If you’re going to invalidate the entire cache based upon any data change, it’s probably not worth caching at all. But, if you want to go that route, you can always implement the per-view cache.)

Cache-control is a hard problem, and one not suited for a one-size-fits-all answer.

creyD · May 18, 2022, 12:55pm

What I imagined was a method, where I could manually invalidate the cache for a certain page/ object. Because my data needs to be correct all the time.

I see that this is a hard problem, but I imagine a lot of people having similar problem like mine, where a database call is often repeated by the user (i.e. by refreshing a page) and would be much faster gotten from cache. I suppose the complete manual caching from here would be the correct interface for this.

There is no interface for invalidating a certain dataset/ page/ model in the cache?

KenWhitesell · May 18, 2022, 2:14pm

Then don’t cache. Those two requirements are in direct opposition to each other.

Except the cache needs to know when the data has changed, which means the database needs to be queried anyway.

Keep in mind that PostgreSQL is already going to cache as much data as it physically can. Your queries do not necessarily create IO activity by accessing physical disk.

One of my major systems has a 4 GB database - I’ve got the database memory allocated to the level that everything remains memory resident. The level of read requests on the drives is remarkably low.

You can build your own cache manager on top of the lower-level API. <opinion> Personally, I doubt the value of it. I think your efforts would be better spent tuning other parts of your environment. </opinion>

The site-cache is a page-level cache indexed (effectively) by URL. I’m not sure that a general solution to identifying which URLs are affected by a data change is practical.

creyD · May 18, 2022, 2:47pm

Thanks for the detailed answers!

I believe it is makes a difference if the database is queried within the view at user runtime or if I would add a signal which updates the cache from the DB in the background. I don’t mind the database query once in a while as the data isn’t changed that often. But when it is changed it shouldn’t be shown incorrectly after.

This is not really my problem. My problem is that views/ view sets of my API make a dozen Postgres queries to get all the info (N+1 problem) for one request. Especially with list views this takes a long time. I tried mitigating this with select_related/ prefetch, which could help alternatively, but it is really messy. I just wanted code-wise a cleaner option, as we would need to maintain the different selection parameters manually from now on.

I beg to differ here, I believe this depends on the strategy. As far as I understand a Write-Through cache i.e. is supposed to solve exactly this problem.

KenWhitesell · May 18, 2022, 3:15pm

Fixing that problem is going to pay a lot higher dividends than trying to engineer a caching solution.

The problem here is the indirect relationship between a URL and the underlying data. A write-through cache is appropriate at the database layer. But there’s no direct link between a URL and database tables. You cannot definitively identify (intrinsically) what URLs will be affected when any specific instance of a Model is updated.

The old joke:

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

has more truth to it than not.

Side note: To go back to your original question regarding usage for testing, the docs go on to say:

Responses to requests for the same URL with different query parameters are considered to be unique pages and are cached separately.

This means that you could generate an arbitrary query variable for your url, ensuring that the test won’t retrieve a cached version.

creyD · May 18, 2022, 3:36pm

I suppose this is correct and wasn’t entirely clear to me, I expected that there is some kind of connection. Understanding the mechanisms now it is obvious that only this “dumb data” is stored in the cache.

The local test was only to see if there is some kind of performance improvement and to test out the strategy and find errors, like this one. Thanks for all your help!

DanielGnzlzVll · May 20, 2022, 2:22am

Hi, depending on your system requirements you can django-cachalot which is an awesome library

creyD · May 20, 2022, 7:31am

Thanks a lot for the tip! I am currently testing this, but at the moment I have to say everything looks very good!

Topic		Replies	Views
Downstream Cache Invalidation Using Django	1	758	October 6, 2021
key-based invalidation of template fragment does not work Using Django	0	312	November 4, 2022
Cache invalidation/renewal options Django Internals	4	178	June 13, 2024
Django cache per view not working Using Django	0	848	November 10, 2020
Caching problem with Django Rest Framework Using Django	4	3574	May 21, 2021

Invalidating Cache doesn't happen automatically

Related topics