After upgrading from django 3.0 to 3.1, django.db.utils.InterfaceError: connection already closed issue started occuring randomly.
gunicorn with 1 worker 6 threads
postgresql version 13
django version 3.1
Using wsgi
Is it possible that the introduction of async support in django 3.1 is somehow introducing instability in DB connection? I don’t see any other feature in django 3.1 that might cause this issue.
Upgraded Django to versions 3.2 and 4.0, but the issue persists.
Is it possible that we are facing a race condition due to the use of multithreading and the introduction of async on Django 3.1? We are not using async functionality on our code.
Actually, I’d be more likely to want to investigate what your “close_old_connections” handler does, and whether or not its functionality is affected by the upgrade. If that handler does anything with database connections, that’s likely the culprit.
(Or are you talking about the system-provided handler and not your own handler? If so, then that’s a different issue.)
The only other comment I would make - and this is drawing from old experience which may no longer be true - is that we never run wsgi processes in a multi-thread worker. We only use the multi-worker environment where each worker runs a single thread. (This goes back to 2014 - Django 1.6 on Python 2, again acknowledging that a lot has changed since then - but this has been our common practice and we’ve never seen a reason to look to change that.)
What I’m seeing in the gunicorn docs is that they recommend either workers or threads to be set to " 2-4 x $(NUM_CORES) ", so neither setting really appears to provide a benefit over the other. And given that these processes are intended to be killed / restarted on a periodic basis, it does seem safer to me to run 1 thread per process.
It does seem reasonable to believe that the worker-restart process in gunicorn may not be fully thread-safe and that this issue is revealed by some change in Django 3.1.
The issue was actually related with the use of smart_open third party library. The library was used to serve an SPA from django. As the SPA was also consuming some other endpoints of the same API, somehow the connections were dropped.
I am here to bring this up and share our experience.
We are experiencing the same error with long multi-threaded processes that we initiate via our custom manage.py command. These threads run for weeks, sometimes even months, but never any longer than a few months as this error knocks them out.
@Simanas what’s the correct way to handle this. I am starting a Kafka broker using manage.py command and the script experiences this error very frequently. How to overcome this ??
If someone in the community identifies an issue that they need fixed, they can file a ticket. But until someone in the community can develop the patch to fix it, it will remain unfixed.
@utkarshpandey12 Thank’s for creating it! It got closed up quiet quickly with an answer that we have been all looking for! Damn it’s good! I have quickly implemented necessary changes to our routines and now I finally have got my good night’s sleep back.
Reposting it here, in case somebody finds himself in this thread and wants to know what to do:
Well yes. As per Simon’s comment all you got to do is to introduce periodic old connections cleanup.
Since connections are shared across threads I have initiated a separate thread like so, before launching our main long running process:
def periodic_connections_cleanup(self, exit):
while not exit.is_set():
time.sleep(60 * 60)
print("Hourly old connections close up!")
close_old_connections()
def run_longrunner(self):
threading.Thread(
target=self.periodic_connections_cleanup, args=(self.exit_event, ),
daemon=True
).start()
# braaaappapppappapap runs forever from here!
I did some heavy testing, doing many various requests to the database while doing close_old_connections() every second. Seems to be working without any issues. Now only time will tell how it really performs over a period of a few months, but I am very optimistic!
what if I do cleanup in the command thread only instaed of starting a new thread ??
just like Simon had done in the answer followed by one second sleep()
any drawbacks in this ?? @Simanas
That’s perfectly fine if it is something that you can implement in to your main thread. I had to start a separate thread due to complicated things that happen later in my main process.
and those connections are not closed automatically, so it means they stay forever in the long running thread, which means that you can’t close these these old connections from an other thread as it has it’s own connections and they are not shared between multiple threads.
So I think what happens is that database closes long running database connections as this is not how database should be used after all, and we get this error in our processes.
However I have also found that there is CONN_HEALTH_CHECKS option that can be enabled on database, which obviously checks connection health before making any requests to the database, which in result will create some overhead, but if you are not running a site with one million visitors a day that should not be an issue.
So I have now enabled it on my project. Note that if do not hear hear back from me in this thread for more than 3 months, it means that it worked out beautifully!