We’ve written a large (more than 3.5 million lines of Python) application that heavily uses Django to drive a PostgreSQL database. One type of operation that has historically caused us difficulty involves transferring large amounts of data to a client application as part of an API call: think of a user waiting to see a grid containing 100+K rows by 200 columns worth of data, totaling hundreds of megabytes of memory. Performance of these particular APIs is fairly important, although not critical, but they cannot consume enough resources that it impacts performance of other parts of the application. In particular, we don’t want to tie up our web server process threads doing a long-running synchronous request, and we don’t want to consume large amounts of memory (causing the web servers to be saturated).
For quite some time, our dream for a solution has been to have an async view that reads chunks from a server-side cursor and sends each chunk of output to the client via a websocket. When the LTS version of Django 4.2 arrived, we thought the time was right to implement that solution. It’s been harder than expected, but we believe we finally have it running reasonably well.
Here are some of the potholes and roadblocks we had to overcome.
PLEASE NOTE: This is NOT a complaint about Django features — we know and understand that full async support is being developed over the course of several releases. Django is doing everything that it’s documented as being able to do, and that was almost enough to implement our vision. We’re publishing this just to share some thoughts with those folks that would like to follow a similar route.
Async Top to Bottom (well, almost).
We use Channels for our websocket interface, and (for a completely unrelated purpose) added a light-weight messaging layer on top of the Channels interface — this layer allows us to survive short-term disconnections of web sockets, caused by things like a client’s internet router or firewall rebooting. We’ve got async code running all the way from the Channels consumer down into our business logic, where we eventually “await” the results of our database operations. Responses streaming back to a client become simple: Each outbound chunk of data is formatted into a light-weight message that Channels sends on our behalf, with the client consolidating all of these messages into the complete result. We do acknowledge messages and retransmit messages that aren’t acknowledged in a timely manner. While not as robust as a real messaging broker, it does give us bidirectional communications that supports parallel requests with streaming responses that may be interleaved with one another.
For this portion of the application, we’re not using Django’s HTTP request processing. The large majority of the application still uses traditional HTTP request/response logic, although we intend to move more towards websocket interfaces (or eventually towards HTTP/3 and QUIC).
Under the Covers in Django 4.2
Querysets finally support a number of async methods: We have async iterators to rummage through the contents of a queryset, and each of the “terminal” methods (such as first(), latest(), get(), delete(), etc.) all have async versions (afirst(), alatest(), aget(), adelete(), etc.). That’s nice! We can now “await” the results of any queryset operation.
Under the covers, though, these external async methods are simply shells that internally use the “sync_to_async()” wrapper to invoke the synchronous version of the method. We can certainly understand why that was the design choice for this release: Backwards compatibility promises (such as support of psycopg2) and support for Python 3.6 have required some compromises in the internal implementation. But the true impact of this design wasn’t entirely obvious when we started our implementation.
Quite simply, this implementation results in having the actual database operations (opening cursors and fetching results) execute on a thread that isn’t the thread on which the async business logic is running.
This means that there are now three threads involved in processing requests in a typical Django worker process: the “main thread” that normally handles incoming HTTP requests; the “event loop” thread that’s managing concurrent async tasks, and a “synchronous database thread” that is used to execute all of the “sync_to_async()” methods.
And the most subtle aspect of all this — Django’s “connection” objects are thread-specific. We’ll talk much more about this later on.
The Async QuerySet Iterator doesn’t support prefetch_related().
My understanding is that this was a feature that simply didn’t get implemented in time to be released with 4.1 or 4.2.
This wasn’t too hard to solve. For other reasons, we had previously implemented our own QuerySet and ModelManager classes, derived from the Django base classes. So we were able to read the Django iterator() code to see how it populated the prefetch cache, then graft that logic (wrapped with sync_to_async(), of course) into a modified version of aiterator().
There is a Django fork to add this code into 5.0 — it needs a little work on the unit tests to be accepted for merging into base.
Database Connections are Thread Specific!
This turned out to be really subtle for us. Django’s “connections” object (django.db.connections) looks like a dictionary, but it’s not. Rather than storing a value for each key in some sort of internal dictionary, it actually executes a “setattr()” call on an internal object named “_connections” (using the supplied key as the attribute name), while getting the value for a given key is implemented with a call to getattr() (again, using the key as the name of the attribute to get).
Furthermore, _connections is an instance of a Local() class, which behaves somewhat like a cross between a thread local variable and a Python 3.7 ContextVar. In general, it’s most accurate to think that the values in django.db.connections are unique to each thread, even though the values across threads have the same keys.
Note that each value in Django’s “connections” structure will be an instance of a class named “DatabaseWrapper”. (The class of the backend “engine” is internally stored in a variable named “Database”, and the connections to those Database engines are called DatabaseWrappers. Just think of them as “connections” and it will mostly make sense until you go hunting for the code.)
Why is all of this important? Because calling the close() method on a connection from an async piece of business logic won’t close the actual connection used to talk to the database — since the connection that is really used is unique to the “sync database thread", not the thread running the async business logic.
Django LOVES to close connections!
This accounted for a large portion of our struggles.
By default, Django connects two methods (“reset_queries” and “close_all_connections”) to the “request_started” Django signal, and connects “close_all_connections” to the “request_finished” signal. Reset_queries() clears the internal log of queries executed on the connection, and close_all_connections() spins through each defined database connection and calls the connection’s “close” method if the connection has been open long enough, or if it had become unusable because of some exception. Since the default “CONN_MAX_AGE” value is 0, this effectively means that by default every open connection is closed both at the beginning and at the end of every inbound HTTP request.
In addition to those signal handlers, the database_sync_to_async() wrapper, provided in the “asgi” module used by Channels, also calls close_all_connections() before and after each operation it wraps.
These calls to close_all_connections() completely disrupted the implementation of our vision.
In particular, our major goal was to use a server-side cursor to iterate through large data sets without having to load everything into memory up front. In an async approach, this means that we ought to use use the “async for row in queryset” syntax. Django’s async iterator method, aiterator(), eventually opens a “chunked cursor” on the database connection, and (if you go deep enough in the code), uses a method wrapped in sync_to_async() to retrieve each chunk in turn from the cursor.
So the aiterator() method is running on the async event loop thread, and the code that actually builds a chunk of data reading from the database cursor is running on the “sync database thread". That database cursor persists across all of the chunked reads, as each chunk of rows is handed back to the async business logic. More precisely, the method to return one chunk of data will be executed on the “sync database thread”, and it expects the cursor from which it is reading to survive from one invocation of the method to the next even though this method releases the thread to run some other sync method.
But Database Connections are Thread Specific!
If you have more than one async task accessing the database at the same time, each of these async tasks will toss a method over to the “sync thread handler” to execute. If these methods use the same connection alias (the same name defined in settings.DATABASES), then those methods will actually use the same instance of a Django database connection (the instance unique to the thread on which those methods are running). Each method opens its own cursor (or cursors) on the connection… but the connection itself is shared by the all of the methods executed one-at-a-time on this single thread.
And when one of these connections are closed, all cursors for that connection are also closed.
This means that one async task that closes its connection can cause some other async task with a server-side cursor to be interrupted in its work. That’s bad…
… and Database Transactions are Connection Specific!
Besides the issue with unintended closing of cursors, database transactions turn out to be connection-specific and not cursor-specific. When Django begins an atomic transaction, the connection’s “auto commit” mode is turned off. From that moment on, all operations on that connection are considered part of the same database transaction — even if those operations are launched from completely different async tasks.
In our case, two concurrent async tasks might be operating on data for two completely independent tenants in our database. Having operations from two different tenants managed within a single database transaction is completely unacceptable for our application. In short, Django’s thread-specific connection logic is completely unsuitable for us to use via async requests.
How to Solve This? Dynamically Created Connections!
The solution we have chosen to implement involves dynamically creating additional database connection definitions and effectively “reserving” separate Django-level connections for each concurrent async task.
We had already implemented our own subclass of the PostgreSQL DatabaseWrapper class — we use separate schemas for each tenant in the database, and we had added support for getting the current schema from a ContextVar for each request being processed. We added another ContextVar — this one holding a “connection suffix” — a string that would be added to the end of the base connection alias to form a unique name for this connection.
Normal “sync” requests do not obtain a connection suffix — they use the Django connections in the same way they always have.
We implemented a singleton “AsyncConnectionPool” that was really a pool of the available suffixes: Each async task obtains a suffix from the pool and stores that suffix in the appropriate ContextVar. If the pool has no available suffixes, we invent a new one of the form “:Async-N”, where N is a sequential counter whose maximum value will be the maximum number of concurrent async tasks that have been run. As each async task completes, it returns its suffix to the pool for reuse by some later task.
At the time we create a new suffix, we extend settings.DATABASES, adding the new connection name (including the suffix) as a key and a value that is the clone of the base database configuration parameters. We also create an instance of our DatabaseWrapper and store it in django.db.connections as needed.
Our custom QuerySet class overrides the “db” property, appending the suffix (obtained from the ContextVar) to the database alias if necessary, so the queryset will automatically use the reserved database connection. In rare cases, such as when we use a connection to execute raw SQL, our async code must specifically supply a fully-qualified connection name through a “using” parameter.
We had to override the close() method in our DatabaseWrapper subclass, adding an optional parameter “really_mean_it” that defaults to False. If close() is called on a dynamically created connection and “really_mean_it” is False, then we simply return without actually closing the connection. This handles the cases where Django’s close_all_connections() method was calling close() on our dynamically generated connections from within database_sync_to_async().
We did have to add logic to our async message handler to obtain the connection suffix at the beginning of an async request, and to return that suffix and really close the connection at the end of the async task.
It Seems to be Working
We haven’t finished running this through our extensive QA testing yet, but things look like they’re working reasonably well. We can support multiple async tasks each using server-side cursors to stream large amounts of data back to the client, while at the same time process normal HTTP requests and not experience any cross-talk between requests. We will, of course, open more actual database connections than we had before implementing the async logic, but one of our reasons for implementing async was to increase parallelism throughout the application by supporting more concurrent requests. More concurrent requests will indeed require more concurrent database connections, so we’re good with that.
Our “suffix pool” appears to be working well to limit the count of additional connections that get created.
Unfortunately, there are still a large number of calls to things like close_all_connections() that sequentially process all defined connections in the lists. Since we’re adding more things to that list, we’re increasing the overhead in that particular area. And it’s even somewhat worse, since we currently have 6 different database aliases defined in our settings file:
- The default transactional database,
- A second connection to the transactional database for logging messages that should not be rolled back if an atomic transaction fails,
- A read replica we can use to offload intensive query work from the main transactional database,
- An independent read-only database updated by a completely different application,
- An obsolete entry for a Celery task database
- An obsolete entry for a separate historical archive database.
Obviously, we should be able to get rid of the last two – but still, every time we create a new suffix we create additional entries for every one of these database aliases… so the list grows somewhat quickly.
Ideally, we’d love to cut down on the work done by frequently-called methods like close_all_connections(), perhaps calling that method less frequently or by limiting the scope of the connections it inspects.
What About Connection Pooling?
We use PgBouncer as a session-based connection pool on each of our web and app servers. If we implemented an actual connection pool within our application, we would somewhat increase the number of connections to our database, since we support multiple versions of our application on each server machine. By having the pooling operate at a layer above the version-specific code, we pool actual database connections across all releases of our code.
This also means that an actual database connection (from PgBouncer to our database server) might well be used first by some of our synchronous business logic, then later by some async logic. We don’t believe this matters at all… PgBouncer will keep the connections in its pool straight, and shouldn’t care about subtle differences between the application server to PgBouncer connections.
Psycopg 3 offers a built-in async connection pool that has some attraction, but again that would be version-specific pooling. We think we’re still better off with PgBouncer’s pool spanning versions of code.
All the same, we’re exploring a complete move to Psycopg 3, once we resolve existing dependencies on Psycopg2. We’d like to be using Psycopg 3 by the time Django begins taking advantage of the async interfaces in that driver.
Maybe Django 5.0 will help?
One of the reasons we initially chose Django as a framework seven or eight years ago was that it has been actively updated and enhanced in a professional manner. We started with version 1.4 and have grown with it over the years. Since we believe our use cases are not terribly unusual, we believe that Django will be growing in the direction of our needs.
(Personally, I’ll be retiring from the job next spring and would like to spend some time contributing back to the ORM portion of the framework. I’m also hoping to convince my boss to let me spend more time on this in the months ahead as we transition my architectural role to others.)