django.db.utils.InterfaceError: connection already closed when updating from django 3.0 to 3.1

After upgrading from django 3.0 to 3.1, django.db.utils.InterfaceError: connection already closed issue started occuring randomly.

  • gunicorn with 1 worker 6 threads
  • postgresql version 13
  • django version 3.1
  • Using wsgi

Is it possible that the introduction of async support in django 3.1 is somehow introducing instability in DB connection? I don’t see any other feature in django 3.1 that might cause this issue.

I would also check and verify the versions of all other packages for being the most current and compatible set of packages for this installation.

Side note: I’ve not seen any issues like this with Django 3.2, but we don’t use gunicorn. We run Django in uwsgi behind nginx.

2 Likes

The issue disappears when changing Gunicorn to 1 worker 1 thread, or if the following signals are removed:

signals.request_started.connect(close_old_connections)
signals.request_finished.connect(close_old_connections)

Upgraded Django to versions 3.2 and 4.0, but the issue persists.

Is it possible that we are facing a race condition due to the use of multithreading and the introduction of async on Django 3.1? We are not using async functionality on our code.

Actually, I’d be more likely to want to investigate what your “close_old_connections” handler does, and whether or not its functionality is affected by the upgrade. If that handler does anything with database connections, that’s likely the culprit.

(Or are you talking about the system-provided handler and not your own handler? If so, then that’s a different issue.)

1 Like

I’m talking about the system-provided handler. I commented out, following the lead on this discussion: DB connection closed by uWSGI in the middle of handling a request?! - #2 by richardthegit, and got exactly the same outcome.

1 Like

The only other comment I would make - and this is drawing from old experience which may no longer be true - is that we never run wsgi processes in a multi-thread worker. We only use the multi-worker environment where each worker runs a single thread. (This goes back to 2014 - Django 1.6 on Python 2, again acknowledging that a lot has changed since then - but this has been our common practice and we’ve never seen a reason to look to change that.)

What I’m seeing in the gunicorn docs is that they recommend either workers or threads to be set to " 2-4 x $(NUM_CORES) ", so neither setting really appears to provide a benefit over the other. And given that these processes are intended to be killed / restarted on a periodic basis, it does seem safer to me to run 1 thread per process.

It does seem reasonable to believe that the worker-restart process in gunicorn may not be fully thread-safe and that this issue is revealed by some change in Django 3.1.

1 Like

The issue was actually related with the use of smart_open third party library. The library was used to serve an SPA from django. As the SPA was also consuming some other endpoints of the same API, somehow the connections were dropped.

I am here to bring this up and share our experience.

We are experiencing the same error with long multi-threaded processes that we initiate via our custom manage.py command. These threads run for weeks, sometimes even months, but never any longer than a few months as this error knocks them out.

@Simanas what’s the correct way to handle this. I am starting a Kafka broker using manage.py command and the script experiences this error very frequently. How to overcome this ??

We have introduced daily restarts of these processes as a duck tape fix… Hopefully somebody from Django core dev team comes up with something…

Note:

There is no such thing.

Django is a 100% community-driven project.

If someone in the community identifies an issue that they need fixed, they can file a ticket. But until someone in the community can develop the patch to fix it, it will remain unfixed.

2 Likes

@Simanas

But in deployed services how to restart the script run through manage.py command.

How to do that.

My application is running inside kubernetes with multiple pods scaling as per incoming requests.

When the app starts on doing a push then
From the root it executes a start.sh shell script which contains these command

start.sh

Python manage.py collectstatic

Python manage.py kafka.py ( this is where issue comes up)

gunicorn bind 8000 – timeout 60 --workers 5

@KenWhitesell
I shall raise ticket regarding this soon and update here. Thanks for your input.

1 Like

ticket_link

Filed ticket based on what I have faced for months now.
@KenWhitesell @Simanas

1 Like

@utkarshpandey12 Thank’s for creating it! It got closed up quiet quickly with an answer that we have been all looking for! Damn it’s good! I have quickly implemented necessary changes to our routines and now I finally have got my good night’s sleep back.

Reposting it here, in case somebody finds himself in this thread and wants to know what to do:

How to deal with long running processes initiated via manage.py command?

@Simanas is the issue resolved in your case? Can you confirm this ??

Well yes. As per Simon’s comment all you got to do is to introduce periodic old connections cleanup.

Since connections are shared across threads I have initiated a separate thread like so, before launching our main long running process:

    def periodic_connections_cleanup(self, exit):
        while not exit.is_set():
            time.sleep(60 * 60)
            print("Hourly old connections close up!")
            close_old_connections()

    def run_longrunner(self):
        threading.Thread(
            target=self.periodic_connections_cleanup, args=(self.exit_event, ),
            daemon=True
        ).start()
        
        # braaaappapppappapap runs forever from here!

I did some heavy testing, doing many various requests to the database while doing close_old_connections() every second. Seems to be working without any issues. Now only time will tell how it really performs over a period of a few months, but I am very optimistic! :blush:

what if I do cleanup in the command thread only instaed of starting a new thread ??
just like Simon had done in the answer followed by one second sleep()
any drawbacks in this ??
@Simanas

That’s perfectly fine if it is something that you can implement in to your main thread. I had to start a separate thread due to complicated things that happen later in my main process.

1 Like

Thanks for this. Appreciate your inputs.
Will test this and see if the issue is fixed after monitoring for a while.

@Simanas @KenWhitesell

I am here to report that my previous solution did not worked out. We have got in to this error again… so upsetting. :frowning:

So I did some more digging, and figured that my previous thought was not correct. It turns out that Django creates new db connection for ever new thread

and those connections are not closed automatically, so it means they stay forever in the long running thread, which means that you can’t close these these old connections from an other thread as it has it’s own connections and they are not shared between multiple threads.

So I think what happens is that database closes long running database connections as this is not how database should be used after all, and we get this error in our processes.

However I have also found that there is CONN_HEALTH_CHECKS option that can be enabled on database, which obviously checks connection health before making any requests to the database, which in result will create some overhead, but if you are not running a site with one million visitors a day that should not be an issue.

So I have now enabled it on my project. Note that if do not hear hear back from me in this thread for more than 3 months, it means that it worked out beautifully! :slight_smile: