An Async Journey with Django 4.2

callerbear · June 28, 2023, 12:11am

We’ve written a large (more than 3.5 million lines of Python) application that heavily uses Django to drive a PostgreSQL database. One type of operation that has historically caused us difficulty involves transferring large amounts of data to a client application as part of an API call: think of a user waiting to see a grid containing 100+K rows by 200 columns worth of data, totaling hundreds of megabytes of memory. Performance of these particular APIs is fairly important, although not critical, but they cannot consume enough resources that it impacts performance of other parts of the application. In particular, we don’t want to tie up our web server process threads doing a long-running synchronous request, and we don’t want to consume large amounts of memory (causing the web servers to be saturated).

For quite some time, our dream for a solution has been to have an async view that reads chunks from a server-side cursor and sends each chunk of output to the client via a websocket. When the LTS version of Django 4.2 arrived, we thought the time was right to implement that solution. It’s been harder than expected, but we believe we finally have it running reasonably well.

Here are some of the potholes and roadblocks we had to overcome.

PLEASE NOTE: This is NOT a complaint about Django features — we know and understand that full async support is being developed over the course of several releases. Django is doing everything that it’s documented as being able to do, and that was almost enough to implement our vision. We’re publishing this just to share some thoughts with those folks that would like to follow a similar route.

Async Top to Bottom (well, almost).

We use Channels for our websocket interface, and (for a completely unrelated purpose) added a light-weight messaging layer on top of the Channels interface — this layer allows us to survive short-term disconnections of web sockets, caused by things like a client’s internet router or firewall rebooting. We’ve got async code running all the way from the Channels consumer down into our business logic, where we eventually “await” the results of our database operations. Responses streaming back to a client become simple: Each outbound chunk of data is formatted into a light-weight message that Channels sends on our behalf, with the client consolidating all of these messages into the complete result. We do acknowledge messages and retransmit messages that aren’t acknowledged in a timely manner. While not as robust as a real messaging broker, it does give us bidirectional communications that supports parallel requests with streaming responses that may be interleaved with one another.

For this portion of the application, we’re not using Django’s HTTP request processing. The large majority of the application still uses traditional HTTP request/response logic, although we intend to move more towards websocket interfaces (or eventually towards HTTP/3 and QUIC).

Under the Covers in Django 4.2

Querysets finally support a number of async methods: We have async iterators to rummage through the contents of a queryset, and each of the “terminal” methods (such as first(), latest(), get(), delete(), etc.) all have async versions (afirst(), alatest(), aget(), adelete(), etc.). That’s nice! We can now “await” the results of any queryset operation.

Under the covers, though, these external async methods are simply shells that internally use the “sync_to_async()” wrapper to invoke the synchronous version of the method. We can certainly understand why that was the design choice for this release: Backwards compatibility promises (such as support of psycopg2) and support for Python 3.6 have required some compromises in the internal implementation. But the true impact of this design wasn’t entirely obvious when we started our implementation.

Quite simply, this implementation results in having the actual database operations (opening cursors and fetching results) execute on a thread that isn’t the thread on which the async business logic is running.

This means that there are now three threads involved in processing requests in a typical Django worker process: the “main thread” that normally handles incoming HTTP requests; the “event loop” thread that’s managing concurrent async tasks, and a “synchronous database thread” that is used to execute all of the “sync_to_async()” methods.

And the most subtle aspect of all this — Django’s “connection” objects are thread-specific. We’ll talk much more about this later on.

The Async QuerySet Iterator doesn’t support prefetch_related().

My understanding is that this was a feature that simply didn’t get implemented in time to be released with 4.1 or 4.2.

This wasn’t too hard to solve. For other reasons, we had previously implemented our own QuerySet and ModelManager classes, derived from the Django base classes. So we were able to read the Django iterator() code to see how it populated the prefetch cache, then graft that logic (wrapped with sync_to_async(), of course) into a modified version of aiterator().

There is a Django fork to add this code into 5.0 — it needs a little work on the unit tests to be accepted for merging into base.

Database Connections are Thread Specific!

This turned out to be really subtle for us. Django’s “connections” object (django.db.connections) looks like a dictionary, but it’s not. Rather than storing a value for each key in some sort of internal dictionary, it actually executes a “setattr()” call on an internal object named “_connections” (using the supplied key as the attribute name), while getting the value for a given key is implemented with a call to getattr() (again, using the key as the name of the attribute to get).

Furthermore, _connections is an instance of a Local() class, which behaves somewhat like a cross between a thread local variable and a Python 3.7 ContextVar. In general, it’s most accurate to think that the values in django.db.connections are unique to each thread, even though the values across threads have the same keys.

Note that each value in Django’s “connections” structure will be an instance of a class named “DatabaseWrapper”. (The class of the backend “engine” is internally stored in a variable named “Database”, and the connections to those Database engines are called DatabaseWrappers. Just think of them as “connections” and it will mostly make sense until you go hunting for the code.)

Why is all of this important? Because calling the close() method on a connection from an async piece of business logic won’t close the actual connection used to talk to the database — since the connection that is really used is unique to the “sync database thread", not the thread running the async business logic.

Django LOVES to close connections!

This accounted for a large portion of our struggles.

By default, Django connects two methods (“reset_queries” and “close_all_connections”) to the “request_started” Django signal, and connects “close_all_connections” to the “request_finished” signal. Reset_queries() clears the internal log of queries executed on the connection, and close_all_connections() spins through each defined database connection and calls the connection’s “close” method if the connection has been open long enough, or if it had become unusable because of some exception. Since the default “CONN_MAX_AGE” value is 0, this effectively means that by default every open connection is closed both at the beginning and at the end of every inbound HTTP request.

In addition to those signal handlers, the database_sync_to_async() wrapper, provided in the “asgi” module used by Channels, also calls close_all_connections() before and after each operation it wraps.

These calls to close_all_connections() completely disrupted the implementation of our vision.

In particular, our major goal was to use a server-side cursor to iterate through large data sets without having to load everything into memory up front. In an async approach, this means that we ought to use use the “async for row in queryset” syntax. Django’s async iterator method, aiterator(), eventually opens a “chunked cursor” on the database connection, and (if you go deep enough in the code), uses a method wrapped in sync_to_async() to retrieve each chunk in turn from the cursor.

So the aiterator() method is running on the async event loop thread, and the code that actually builds a chunk of data reading from the database cursor is running on the “sync database thread". That database cursor persists across all of the chunked reads, as each chunk of rows is handed back to the async business logic. More precisely, the method to return one chunk of data will be executed on the “sync database thread”, and it expects the cursor from which it is reading to survive from one invocation of the method to the next even though this method releases the thread to run some other sync method.

But Database Connections are Thread Specific!

If you have more than one async task accessing the database at the same time, each of these async tasks will toss a method over to the “sync thread handler” to execute. If these methods use the same connection alias (the same name defined in settings.DATABASES), then those methods will actually use the same instance of a Django database connection (the instance unique to the thread on which those methods are running). Each method opens its own cursor (or cursors) on the connection… but the connection itself is shared by the all of the methods executed one-at-a-time on this single thread.

And when one of these connections are closed, all cursors for that connection are also closed.

This means that one async task that closes its connection can cause some other async task with a server-side cursor to be interrupted in its work. That’s bad…

… and Database Transactions are Connection Specific!

Besides the issue with unintended closing of cursors, database transactions turn out to be connection-specific and not cursor-specific. When Django begins an atomic transaction, the connection’s “auto commit” mode is turned off. From that moment on, all operations on that connection are considered part of the same database transaction — even if those operations are launched from completely different async tasks.

In our case, two concurrent async tasks might be operating on data for two completely independent tenants in our database. Having operations from two different tenants managed within a single database transaction is completely unacceptable for our application. In short, Django’s thread-specific connection logic is completely unsuitable for us to use via async requests.

How to Solve This? Dynamically Created Connections!

The solution we have chosen to implement involves dynamically creating additional database connection definitions and effectively “reserving” separate Django-level connections for each concurrent async task.

We had already implemented our own subclass of the PostgreSQL DatabaseWrapper class — we use separate schemas for each tenant in the database, and we had added support for getting the current schema from a ContextVar for each request being processed. We added another ContextVar — this one holding a “connection suffix” — a string that would be added to the end of the base connection alias to form a unique name for this connection.

Normal “sync” requests do not obtain a connection suffix — they use the Django connections in the same way they always have.

We implemented a singleton “AsyncConnectionPool” that was really a pool of the available suffixes: Each async task obtains a suffix from the pool and stores that suffix in the appropriate ContextVar. If the pool has no available suffixes, we invent a new one of the form “:Async-N”, where N is a sequential counter whose maximum value will be the maximum number of concurrent async tasks that have been run. As each async task completes, it returns its suffix to the pool for reuse by some later task.

At the time we create a new suffix, we extend settings.DATABASES, adding the new connection name (including the suffix) as a key and a value that is the clone of the base database configuration parameters. We also create an instance of our DatabaseWrapper and store it in django.db.connections as needed.

Our custom QuerySet class overrides the “db” property, appending the suffix (obtained from the ContextVar) to the database alias if necessary, so the queryset will automatically use the reserved database connection. In rare cases, such as when we use a connection to execute raw SQL, our async code must specifically supply a fully-qualified connection name through a “using” parameter.

We had to override the close() method in our DatabaseWrapper subclass, adding an optional parameter “really_mean_it” that defaults to False. If close() is called on a dynamically created connection and “really_mean_it” is False, then we simply return without actually closing the connection. This handles the cases where Django’s close_all_connections() method was calling close() on our dynamically generated connections from within database_sync_to_async().

We did have to add logic to our async message handler to obtain the connection suffix at the beginning of an async request, and to return that suffix and really close the connection at the end of the async task.

It Seems to be Working
We haven’t finished running this through our extensive QA testing yet, but things look like they’re working reasonably well. We can support multiple async tasks each using server-side cursors to stream large amounts of data back to the client, while at the same time process normal HTTP requests and not experience any cross-talk between requests. We will, of course, open more actual database connections than we had before implementing the async logic, but one of our reasons for implementing async was to increase parallelism throughout the application by supporting more concurrent requests. More concurrent requests will indeed require more concurrent database connections, so we’re good with that.

Our “suffix pool” appears to be working well to limit the count of additional connections that get created.

Unfortunately, there are still a large number of calls to things like close_all_connections() that sequentially process all defined connections in the lists. Since we’re adding more things to that list, we’re increasing the overhead in that particular area. And it’s even somewhat worse, since we currently have 6 different database aliases defined in our settings file:

The default transactional database,
A second connection to the transactional database for logging messages that should not be rolled back if an atomic transaction fails,
A read replica we can use to offload intensive query work from the main transactional database,
An independent read-only database updated by a completely different application,
An obsolete entry for a Celery task database
An obsolete entry for a separate historical archive database.

Obviously, we should be able to get rid of the last two – but still, every time we create a new suffix we create additional entries for every one of these database aliases… so the list grows somewhat quickly.

Ideally, we’d love to cut down on the work done by frequently-called methods like close_all_connections(), perhaps calling that method less frequently or by limiting the scope of the connections it inspects.

What About Connection Pooling?

We use PgBouncer as a session-based connection pool on each of our web and app servers. If we implemented an actual connection pool within our application, we would somewhat increase the number of connections to our database, since we support multiple versions of our application on each server machine. By having the pooling operate at a layer above the version-specific code, we pool actual database connections across all releases of our code.

This also means that an actual database connection (from PgBouncer to our database server) might well be used first by some of our synchronous business logic, then later by some async logic. We don’t believe this matters at all… PgBouncer will keep the connections in its pool straight, and shouldn’t care about subtle differences between the application server to PgBouncer connections.

Psycopg 3 offers a built-in async connection pool that has some attraction, but again that would be version-specific pooling. We think we’re still better off with PgBouncer’s pool spanning versions of code.

All the same, we’re exploring a complete move to Psycopg 3, once we resolve existing dependencies on Psycopg2. We’d like to be using Psycopg 3 by the time Django begins taking advantage of the async interfaces in that driver.

Maybe Django 5.0 will help?
One of the reasons we initially chose Django as a framework seven or eight years ago was that it has been actively updated and enhanced in a professional manner. We started with version 1.4 and have grown with it over the years. Since we believe our use cases are not terribly unusual, we believe that Django will be growing in the direction of our needs.

(Personally, I’ll be retiring from the job next spring and would like to spend some time contributing back to the ORM portion of the framework. I’m also hoping to convince my boss to let me spend more time on this in the months ahead as we transition my architectural role to others.)

carltongibson · June 28, 2023, 10:43am

Hi @callerbear — thanks for posting this — really interesting! (Long, but interesting )

Can you share this? Dynamic connections aren’t really supported, but should be feasible — in theory at least — so it would be great to read what you came up with.

Did you see @apollo13’s draft PR for (PG) connection pooling in Django?

github.com/django/django

Support database connection pooling for postgresql

django:main ← apollo13:psycopg_pool

opened 04:51PM - 20 May 23 UTC

apollo13

+92 -25

So this needs tests and release notes. But I figured there might be one or two p…eople out there who want to try this as is. To get the pool running either set `pool` to True in the database `OPTIONS` dict or set it to a dict which then gets passed on to https://www.psycopg.org/psycopg3/docs/api/pool.html#psycopg_pool.ConnectionPool Please note that this pool is most likely only really useful in the async case. Using a pool with a multi-process deployment (like standard gunicorn workers) is kinda useless since the will be just one concurrent request per process. Realistically speaking, this only becomes interesting in the sync case once we can have multithreaded gil-free python :D

(Testing and feedback on that would be amazing to help push it forwards!)

callerbear · June 28, 2023, 1:19pm

As it all stands, the dynamic connection code is fairly deeply intertwined with lots of other unrelated stuff. I’ll try to build a smaller sample of what was put together, but it will take some time (probably a week or few). Florian’s connection pooling is interesting!

carltongibson · June 28, 2023, 1:34pm

Super, thanks!

It can be sketchy… I guess you add an alias to the connections dict, and then… and then clean up and … — your thoughts on your general approach, and then issues you had to workaround would be a good read.

(No rush! )

callerbear · October 15, 2024, 1:56pm

It’s been 16 months since I published this article, and it seems appropriate to add an update.

TLDR: It’s all working well, and has been in production use for thousands of users for nearly a year. Everything discussed in the original post has remained untouched with no defects uncovered in this layer.

The primary application for which this was developed has been quite successful: We can reliably deliver about a GB of data to a JavaScript client in 30 seconds or so. Outside of two materialized views, there is no caching of results in this chain: Each call executes a reasonably complex SQL query (joining about 20 tables) and streams the result back to the user.

We’re still running the LTS version of Django 4.2 (we only upgrade from one LTS version to the next). Most of our engineers have no idea about the underlying changes to connections – they simply use the async methods of the Django ORM just as they are documented.

We’ve spent the time since then building out our internal framework to support the business logic: Think of it as a “Django Rest Framework on steroids”. Of that process, the large majority of the effort has gone into an “async update engine” that can accept complex JSON data and apply changes to deeply nested ORM models.

Our largest use so far involves updates to about 20 different ORM models (including one-to-one, one-to-many and many-to-many types of relationships). The engine itself can do everything based on a configuration, but all of the important operations include hooks where reusable modules of custom code can be plugged in.

We’re getting very close to having that engine completed and released to production. Initial performance results show that this engine is substantially more performant that our older framework… More precisely, our performance results highlight the performance flaws in our earlier generation of code. This new engine reduces out the SQL commands needed for this process down to very close to the absolute minimum required.

Unfortunately, that work is considered closed source and I cannot publish it.

Overall, we’re happy with the async implementation, although there are times that it can be somewhat more difficult to debug – by its nature, the call stack for an async task doesn’t show who launched it, nor who is waiting for it to complete.

lancegoyke · November 14, 2024, 2:54pm

@callerbear Thank you so much for writing this up. We’re considering the migration to async and we were looking for a story exactly like yours.

callerbear · November 14, 2024, 10:28pm

Thanks! We’re still having good success with our modifications to support async, but we’re also still on Django 4.2 (we only run LTS releases in production). There have been no problems with this in production at all.

We’re shifting more of our app to async. The big thing that we have to watch out for are CPU intensive chunks of code. If something takes much time at all (say, 1/4 or 1/2 second) it can tie up the async event loop and cause other requests on that process to wait. Calling an async sleep(0) periodically helps those chunks of code to “play well with others”.

We did have to doublecheck our middleware stack and rewrite one piece of middleware that was sync. That caused the entire request to start off as sync, instead of async, which hurt concurrency.

Finally, we ran into an interesting limit as we did more mixing of sync and async code: if an async method uses sync_to_async to run a sync method, which then runs another async method, when then tries to make another sync_to_async() call, things fail.

Sync to async runs the sync method on one very specific thread. If subsequent code tries to call synch to async again (while the first sync method is waiting for the middle async method to complete)… that would involve running the second sync method on the specific dedicated thread, which is currently already blocked by the first sync method.

So mixing sync and async is OK to a point, but there is a limit to how much you can do that in nested methods.

baj

carltongibson · November 15, 2024, 8:15am

This would make a good blog post. Lots of folks hit issues here trying to (infinitely?) nest sync and async contexts.

Topic		Replies	Views
Possible performance improvements of psycopg3 support in Django 4.2 ORM	14	12461	March 14, 2023
Parallelism across the same django DB connection? Async	10	1226	August 27, 2024
Maybe it's just a blind spot, when it comes to async django Async	22	5511	May 5, 2024
DEP 0009: Async-capable Django (Discussion about connection for #35629) Async	5	346	June 18, 2025
Asynchronous ORM Async	95	31898	June 1, 2023

An Async Journey with Django 4.2

Related topics