`threading.local()` questions

As at least one user here knows, I have a django-related package that I want to publish on PyPI, so I’ve been going the extra mile to ensure that it will work in environments other than our own. My concern is about a touchy topic: essentially, global variables. I converted my global variables into class attributes, but they are mutable and my goal is to make them thread-safe. I.e. I don’t want one user’s job that changes a global to affect another user’s job.

While I am open to considering alternate design suggestions that would avoid using what are essentially mutable global variables, I think I have a very strong case for this design pattern, and overall requirements that alternate design patterns (that would avoid globals) simply do not meet. So I would like to focus this discussion on how to (or if I can) use threading.local() to make the usage of my globals/class-attributes safe. If you want to know why I have employed globals, you can read my stack post about it when I considered alternate design patterns and decided there was not another pattern that met my requirements.

My code does not explicitly use threads, nor do I intend for my code to create threads in the future. The threads I’m concerned about are the implicit ones (used either by apache or by django’s core code to service requests). Essentially, any concurrent executions with shared memory.

I don’t know much about threading.local(), so I don’t know if it does what I imagine it does. All I want to do is something like this:

from threading import local
class MyModelSuperClass(Model):
    data = local()
    data.auto_update_mode = True

    @classmethod
    def set_auto_update_mode(cls, val=True):
        cls.data.auto_update_mode = val

All of the models inherit from MyModelSuperClass. MyModelSuperClass overrides various Model methods, e.g. .save() to perform auto-updates of various fields in each model, but sometimes, some code may want to change it’s save behavior or buffer the auto-updates and then execute them later.

Then in some 2 concurrently executed views on the django site, I want their data to be independent of one another:

class EditView(...):
    save_model_changes_with_default_autoupdate_behavior()

class LoadView(...):
    MyModelSuperClass.set_auto_update_mode(False)
    load_a_bunch_of_models_from_file_and_buffer_autoupdates()
    # Restore the default
    MyModelSuperClass.set_auto_update_mode()
    perform_mass_auto_updates()

i.e. I don’t want the set in the LoadView to change the behavior of saves in the EditView if the apache server threads these executions with shared memory.

Does my example usage of local() in MyModelSuperClass to set MyModelSuperClass.data protect me from the 2 executions interfering with one another?

Sorry if this is a dumb question. I’m not very familiar with threads in django or apache/nginx, etc. My only experience with threads and apache is with perl/mod_perl from a long time ago.

What concerns me the most about any attempt at providing global variables in Django is the multiple process issue.

Does your proposed solution handle the case where request 1 from a user is handled by process 1, and request 2 from that same user is handled by a different process? Or, what is the stability of that variable if process 1 is restarted between request n and request n+1?

Worker processes are not under your control. They’re under the control of whatever wsgi container is being used.

I can’t envision any scenario where I would want to try and implement something like that in Django. At a minimum, I would use something like memcached or redis as a multi-process-safe data store that could be indexed by the session ID or some similar facility.

Currently, everything (using class attributes, and previously global variables) is process-safe. Each process has its own independent copy of the variables, with the same default value.

If they are separate processes, it doesn’t matter if they came from the same user. I feel like that would be a concern with session or user info, which this doesn’t use. So changes to the variable in a process go away when that process ends.

However, if the same process handles requests serially and doesn’t end between requests, there is a concern if the previous execution doesn’t restore the global variable to its default - and this is handled/addressed. For example, when we run our tests, the same process is running each test serially, and some of those tests change the “globals”. Restoration of the default is handled diligently. Our tests wouldn’t pass if the globals aren’t cleaned up in every case. In fact, 2 things:

  • I plan some changes to take user-error out of the equation, by wrapping processes that need to change any “global” with decorators that catch any exception to reset the “globals” and reset them upon successful completion.
  • I have checks in key methods that make sure the global starting state is as expected and I raise an exception if for example, the buffer of autoupdates I maintain is unexpectedly populated.

Yes, and that is one thing I want to make sure I understand. My understanding is that when any code executes, it’s memory is not shared with any other concurrent process. The only context in which I understand that there is shared memory between concurrent executions, is threads. Threads can be configured to share memory (or not) and my goal here is to understand if using threading.local() keeps those variables separate.

memcached or redis are implementation/project-specific. One of my goals is to be agnostic and have minimal dependencies. I don’t want people who use my class to require one of those as a dependency. But let me make sure we’re on the same page… The only reason I’m using these “globals” is because I want a single command that changes the save behaviors of all classes derived from my superclass for the duration of the execution of some code. I don’t want users to have to to explicitly implement those different behaviors for every save (and other) calls. For example, I don’t want them to have to explicitly add an argument to every save call everywhere in their code. All they should have to do is inherit from my class (and create [decorated] setter methods that generate values for fields in a model) and my code handles all of the auto-updates of those fields that the user-methods spit out the values for.

My stack post tries to cover this in more detail, but the only reason I ever want to change the save behavior is because doing a mass load (or validation) of a lot of data into the database can (depending on the complexity of the user’s setter methods) significantly slow down execution. So my solution was to allow the user to decorate their loading methods that can defer all the autoupdates to the end, which cuts out intermediate updates and results in a significant time savings.

And I just want to make sure that that behavior change is restricted to just that execution.

I started to question this suggestion above, because I felt like this thought you’d had, suggested that you misunderstood what I was doing. I’m still not sure, but I just want to make it clear that there is no case where I ever want to store data that persists (or is common) between requests, processes, or threads. If I can make the globals separate in every one of those contexts, then my code is solid.

These are the cases as I understand them:

  • different concurrent processes
    • they don’t share memory. Each “global” is independently represented in each process’s memory
    • I think that apache may be able to be configured share memory between processes, but I think that using threading.local() would split those up
  • The same process serially handling multiple requests
    • Any changes to “globals” in one request remains changed in the next request.
    • This is the case I have diligently handled. My code catches all exceptions and restores the defaults before re-raising the exception. It also restores after successful completion and there are initial checks that defaults are set when the class is first set up.
  • Requests handled in different threads of the same process
    • They can shared memory, in which case changing the global in 1 thread would change it in every thread of that process.
    • I know that apache can be configured share memory between threads (or not), but I think that using threading.local() would split those up

So my only question is, is if threading.local() behaves as I’ve described to keep the globals thread and process independent.

And of course, I’m still open to alternate suggestions, as long as it can be completely obfuscated from the user using my class. They shouldn’t have to think about it at all.

Thanks for the clarifications - yes, I wasn’t sure that everything was local to one request, or that they would be reset upon the end of the request. (I had gotten the impression it was, but wasn’t sure that it was limited to that.)

I do have a better understanding now of what you’re trying to do here - it is an interesting use-case.

I know you’re addressing the Apache issue here - is using Apache and mod_wsgi considered a requirement for your package? Or is your system intended to be able to be used in other environments such as uwsgi or gunicorn? What about async environments such as Daphne or uvicorn?

(I haven’t deployed to Apache in more than 10 years now - nginx just works so much better for us.)

I couldn’t begin to guess how the different configuration settings for mod_wsgi would be affected by something like this. And async considerations open up a whole different set of possibilities and issues.

<opinion> I’m just not convinced that you’re going to find a robust solution satisfying all your identified requirements and desires that is going to work across all deployment environments. </opinion>

You could rely upon Django’s cache framework then - it would be leveraging whatever caching backend is being used by the system. The biggest risk there is that the system could be deployed where it’s deactivated, which means nothing gets cached.

Ideally, I would like it to work in any environment where Django is used. That’s the reason for my post. We’re actually using nginx, but I don’t handle the web server and my experience with handling webservers and webserver config is clearly very dated. In fact, I’ve been intentionally avoiding developing web interfaces since 2007. I prefer bioinformatics data analysis and scientific algorithm development, but I got pulled into this Django project a few years ago and have been re-learning the ropes. (And I always go all-in, despite my aversion to things like web development.)

Naively, I feel like the question should be simple. “What shared/common memory situations can be encountered and is threading.local() sufficient to separate them?” The reason I feel like it should be simple is because if there were completely unknown ways in which memory of one request “leaked into” or affected another request, I feel like it would be the wild west out there, with all sorts of unpredictable behaviors. But as is often the case, things are usually way more complex than the concepts involved… sigh

I accept that. So perhaps I can retarget to say, 95% of the most popular web server architectures…

So you’ve mentioned “caching” a couple times, and I can’t quite envision how I would employ caching to solve this issue… but perhaps you’re referring to memory caching, i.e. persistent loaded memory to handle requests? Quick to load and service requests?

If that’s the case, then I feel like my solution to “The same process serially handling multiple requests” solves caching concerns. But I could probably add some protections to ensure that requests dictate what the starting state should be (autoupdates=on|off|deferred).

I feel pretty confident that as long as a request has its own independent memory (when employing threading.local), it will solve all of the problems. The only thing I’m worried about is how memory behaves in async configs, and perhaps users intentionally implementing threads in their code…

Ideally, I would hope that async executions of requests (when threading.local is in use) have their own memory. If 2 requests being asynchronously processed proceed with their single-process execution in an interleaved style, is that handled by threading or do they have shared memory regardless?

And I would hope that if the developer is explicitly implementing threads in their django code (or are using something like celery), does threading.local keep their memory separate? I.e. when they create a thread, is the variable set in a threading.local variable duplicated - one independent copy for each thread?

Maybe a way to approach development is to declare support for an architecture and explicitly state that it hasn’t been tested in other architectures - use at your own risk…

For a concrete example… I decided to create a branch to try implementing with threading.local, to see if I could just get all my tests to simply pass in our environment. This is the crux of the change, using the name of my actual class. (I had to remove the type hints, which surprisingly are not supported by the linters if the variable isn’t a direct member of the class):

class MaintainedModel(Model):
    """
    This class maintains database field values for a django.models.Model class whose values can be derived using a
    function.  If a record changes, the decorated function/class is used to update the field value.  It can also
    propagate changes of records in linked models.  Every function in the derived class decorated with the
    `@MaintainedModel.setter` decorator (defined above, outside this class) will be called and the associated field
    will be updated.  Only methods that take no arguments are supported.  This class overrides the class's save and
    delete methods and uses m2m_changed signals as triggers for the updates.
    """
 
-    # Class attributes
+    # Thread-safe mutable class attributes
+    data = local()
+
     # Track whether the fields from the decorators have been validated
-    maintained_model_initialized: Dict[str, bool] = {}
+    data.maintained_model_initialized = {}
+
+    # This tracks the metadata of each derived model class's setters and relations (established in the decorators)
-    updater_list: Dict[str, List] = defaultdict(list)
+    data.updater_list = defaultdict(list)
+
+    # These are the running modes.  Changing these affects the behavior of every derived class at once.
-    auto_updates = True
+    data.auto_updates = True
-    buffering = True
+    data.buffering = True
-    performing_mass_autoupdates = False
+    data.performing_mass_autoupdates = False
+
+    # This is for buffering a large quantity of auto-updates in order to get speed improvements during loading
-    update_buffer = []
+    data.update_buffer = []
+
+    # These allow the user to turn on or off specific groups of auto-updates.  See init_autoupdate_label_filters.
-    default_label_filters: Optional[List[str]] = None
+    data.default_label_filters = None
-    default_filter_in = True
+    data.default_filter_in = True
-    nondefault_filtering_exists = False
+    data.nondefault_filtering_exists = False

Huh, the editor colored the - lines red and the + lines green, but that doesn’t show in the comment once posted… :confused:

I’ve been grinding on trying to figure out how to do this without mutable class attributes (so that everything is a member/instance variable). I’ve had at least 2 underdeveloped ideas…

Both are based on the fact that every model instance, at the time of creation, should know whether its autoupdate_mode will be to immediately autoupdate, buffer (to update later), or disable the autoupdate of its maintained fields.

idea 1: analogue of @override_settings

One thought I had was of the @override_settings decorator in django.test used in some of our tests. If the default setting could be immediate, buffer, or disable, the analogy would be that the “settings” default would be immediate and there would need to be an override to buffer, which could be a applied to loading methods. And for a method like “validation”, the settings override would be disable.

In the MaintainedModel class, I could use django.conf.settings and set the value using the settings, with a default of immediate if not defined in the settings.

The only problems would be:

  1. I would have to import django.test.override_settings in production code
  2. Load scripts typically have a dryrun mode as an argument to change the behavior (to disable) and I would have to figure out how to accommodate that.

idea 2: Add another level of inheritance to “coordinate” derived classes

The other thought was to somehow add another level of inheritance to have a coordinator class that coordinates the autoupdate_mode among all the derived classes in some way… Like in the load/validate method decorators, I override the default MaintainedModel class with like a different default mode… Factory? template class (like in C++)? Abstract class?.. The overall concept is the same. Somehow swap out the coordinator based on the running mode, e.g. with a decorator or something… That would mean that when the developer inherits from MaintainedModel, under the hood, the code swaps out the default MaintainedModel whose mode is immediate with one that has a different default. I just can’t think of how to technically do this…

Been googling most of the day… what is a context manager? Maybe that’s something I can use…

You’re right, it won’t, at least not under the situation you’ve identified. To clarify:

I’ve mentioned before about using memcached or redis as an external data store for that setting, but you want to avoid the additional dependency.

The Django caching framework provides a low-level api to the framework for you to be able to store and retrieve arbitrary data - generally using one of those two data stores. In this situation, you’re not creating a dependency, you’re using something already configured into their Django deployment.

However, I do realize my mistake here in that this still doesn’t address your root issue - you need to manage this by individual request and so you’re still stuck with a “chicken and egg” situation. You could store that setting in the cache, but the code that needs to use it doesn’t have a way of finding it.

Django is not yet fully-async capable. (It’s getting there, but it’s not there yet.) In theory, when everything is trully/fully async, all requests will be handled in the same, one thread. (That’s why it’s theoretically more scalable than traditional sync Django.)

However, right now, it’s in a kind of mix & match mode. If you read through all the async-related docs here, you’ll find situations and conditions where an async request will cause a separate thread to be initiated for a sync-only function - and, if that’s done multiple times in a view, could be different threads used in each function call. You, as a library author (as opposed to the person configuring the installation), don’t know whether calls to function_a() and function_b() are going to occur in the same thread or in different threads.

Side note: Any Django developer creating a Python thread as a means of improving throughput of CPU-bound processing is going to be disappointed. Python threads are not OS threads.

Side note 2: threading.local - this function creates a thread-local storage object. It does not share any data with any other thread. See https://github.com/python/cpython/blob/3.11/Lib/_threading_local.py

Celery involves an external process, not merely a thread. There is no local data shared between a celery task and the main Django process.

Perfect. I’m glad we’re on the same page.

Let me just say… Ken, you have been an invaluable resource, and I truly appreciate your responsiveness and insights.

I was thinking more about alternative methods, and I was reading about context managers. I played around with them in the shell and realized that it won’t really help me Solve my problem. I also explored metaclasses to see if I could dynamically change member variables of all instantiated derived classes, and I tried mixing those two thoughts and could not figure out a way to get it to work the way I want it. It was a pipe dream.

So I thought more about existing features that perhaps in someway mimic what I want to do and one thing that occurred to me was atomic transactions. More specifically, the atomic transaction decorator. If you decorate some high level function, the database operations underneath it, no matter how deep behave a little differently, and they do so without any specific kind of knowledge of the behavior difference. I. A way, that mimics what I want to do. So I would like to learn more about how atomic transactions work and see if that is a design pattern I could employ in this scenario.

I am assuming that there is some sort of layer into which the behavior change of atomic transactions can interject a change of the functioning of the database operations. Do you know much about the inner workings of atomic transactions? Could you give me a somewhat high-level explanation about how they change the behavior without all of the save calls having an extra argument to say, “don’t actually commit what you’re doing”?

From a technical level? Not much.

From an operational level, perhaps a bit more.

Briefly, a transaction does save the data - but it does so in a temporary manner that is invisible to all other queries or transactions.

It may help to think of the rows in the table having a flag named something like “committed”, that is not visible to the general user. Every query being run outside of a specific transaction implicitly includes a where clause for “committed=True”.

Now think of rows that are being changed as being duplicated rows in the database with “committed=False”, with an identifier of the transaction in which it’s being changed. Any queries being written within the scope of that transaction would implicitly include a “where committed=False and transaction_id=my_transaction”, so that those queries do include the data “in progress”.

When you’re in a transaction then, everything you’re doing is looking at those duplicated / modified rows instead of the orignal rows, until you either issue a “COMMIT” command or a “ROLLBACK”.

Committing a transaction then consists of changing the committed flag on the changed rows from false to true and deleting the original rows. That process is done as an “all-or-nothing step” that ensures that all subsequent queries see only the changed versions.

A rollback simply disposes of all those modified rows, leaving the original data unchanged.

(Note: Those more technically knowledgeable on this subject would be quite justified in shredding this analogy. This is my “laymen’s mental model” on the topic and not intended as a technically accurate description of what actually occurs.)

That’s helpful. Though if I were to try to use the design pattern employed by atomic transactions, the crucial part would be how that committed value is applied (when inside or outside an atomic transaction context). I.e. how the deep .save() calls know whether to commit immediately or after the parent atomic transaction block exits…

I essentially want to do the same thing: create a “@buffer_autoupdates” decorator to decorate a load method. I would either autoupdate immediately (when the parent block is not decorated) or buffer and wait until the decorated parent block exits to auto-update everything…

Honestly, I wouldn’t be surprised if it used mutable class attributes to do it… I just can’t wrap my head around how to do it with instance variables… I guess I’ll poke around in the atomic transaction code to see if I can identify the strategy, but I expect it’s pretty complex…

It doesn’t - nor does it care.

Everything associated with transactions is happening within the database engine itself. All that work described above is happening external to Django.

The BEGIN command in PostgreSQL starts a transaction. Every SQL command issued after that using that connection is going to be part of that transaction until either a COMMIT or ROLLBACK command is issued - it doesn’t matter what the SQL is.

OK. I perused django.db.transaction and I think I got the gist of how it works. To possibly oversimplify it, it is grabbing the database connection and essentially toggling connection.commit_on_exit (when used as a context manager). That’s how it does it using instance variables. Presumably, all of the .save() calls utilize that connection for their interactions with the database. It is complicated by nested atomic blocks, creating a stack, but if you boil it down, that’s the common point that affects all the saves under the block. It’s pretty cool. That’s the layer.

It behaves a little differently when used as a decorator. It uses an override of __call__, but I have to poke around some more to understand that usage.

Right. Poor wording. I didn’t mean that literally. I just meant that the effect effectively changes for all of the saves without adding an argument to .save(). And as I discovered from the code, it’s a member of the connection object that changes the behavior.

You may also want to take a look at Database instrumentation | Django documentation | Django for some other ideas.

1 Like

Oops. I over-interpreted the docstring of class Atomic. It’s not an override of __call__.

I think this is the answer! This should allow me to get away from the mutable class attributes(/global variables)! Excellent! Thanks so much!

Well… it may not be the direct answer… I don’t appear to have access to the model objects this way… but the overall design is how to get away from the use of the class attributes/globals… I’ll have to think a bit more about this to see if there’s a way to take advantage of it using the design choices I’d prefer…