InterfaceError and TransactionManagementError (PostgreSQL)

Greetings dear community.

Calling for your help here.
What we are experiencing a lot is “InterfaceError: connection already closed” and “TransactionManagementError”. They are not necessarily accompany each other, but both are consistently happening.
PG logs are not showing anything unexpected or suspicious.
Affected endpoints are different, and I see no common behaviour in them. For example, it sometimes fails even in the admin interface during GET requests.
It does not happen locally.

Our setup pretty standard:

  • Python 3.11, Django 4.1. PostgreSQL 14.5.
  • We are hosted on Heroku, waitress works as a webserver (6 threads), 2-5 servers (dynos) could be running.
  • On top of that, we run 1-2 dramatiq workers (1 process, 4 threads each).
  • Database statement timeout is set to 10000 (10 seconds).
  • And we are using persistent database connections (we tried setting DATABASE_CONN_MAX_AGE between 10 and 600 seconds).
  • We are not using PGBouncer.
  • There are no issues with reaching connection limits, we have plenty of space there.

I am currently out of ideas how to debug this, and would love your advices or any help.

1 Like

Here, I am going to share example tracebacks captured by Sentry:

  1. InterfaceError
    Collapsed:

The only custom code:

  1. TransactionManagementError:

Django third-party libraries are rarely guaranteed to be thread safe. Depending upon what Django-related packages you have installed, one of them may be the source of the problem.

I always suggest running multiple separate processes for your Django instances with single threads per process.

Also, Python 3.11 compatibility isn’t supported in 4.1 before 4.1.3 - make sure your Django is up-to-date.

Multithreading wouldn’t explain dropped connections, thought, right? It would only explain obsolete open connections. However, those wouldn’t be used in a running thread.

Thanks for the answer!

I did skim through our packages and there is nothing that could mess up with database connections.
Also, as @codingjoe mentioned that symptoms would be different - I do not see any burst of obsolete connections, but rather ones I am using are being randomly dropped.

Regarding Python and Django versions - we are running current latest Django (4.1.3) and this issue was there on Python 3.10.6 / Django 4.0.7 as well.

That’s not the symptoms that we have seen in that case. The connections can be opened / closed irregularly based upon load and activity.

I don’t have an explanation, merely the observation that our connection-related issues went away (along with a number of other occasional “odd” errors) when we switched to a “process-only” deployment.

1 Like

Okay, my latest findings show that Django opens up a connection and almost instantly closes it.
This happens in this part:

    def close_if_unusable_or_obsolete(self):
        """
        Close the current connection if unrecoverable errors have occurred
        or if it outlived its maximum age.
        """
        if self.connection is not None:
            self.health_check_done = False
            # If the application didn't restore the original autocommit setting,
            # don't take chances, drop the connection.
            if self.get_autocommit() != self.settings_dict["AUTOCOMMIT"]:
                self.close()
                return

I guess, it might be due to atomic transactions (the only thing I know that will turn off autocommit mode), but I am not sure how and why. As in case of GET request to admin object detail page, it does not make any sense…

Any luck with this? We’re running into the same issue…it’s completely intermittent. Been burning so much time trying to find out why and how to fix this issue.

Hey @tsantor ! Sadly, there is no positive updates from my end.

We recently updated to Django 4.2 and now trying to switch to psycopg3.
Sadly, both changes seem not to resolve the issue (but I need to test it properly).

Could you, please share your stack and the way you host your server? I wonder if this has something to do with Heroku…

Best,
Rust

Exact same issue for me, no solutions yet, I gave more details on my config here: Seemingly random PostgreSQL connection drops while handling Django requests - #19 by lv16

I am considering this topic resolved, as in our case we were able to pin the issue down to whitenoise version. Using whitenoise==5.3.0 resolved all the problems. Kudos @kgrinberg for the hint!

I also posted the issue to whitenoise repository for the follow-up tracking:

We will try to find out what exactly is troubling there.

I’m going to share my experience with this, is not totally related to your setup, but i think this may help someone on the wild later on (possibly me).

TL:DR:

Don’t have a proccess that uses the database, and have it idle for a long time on AWS RDS. It will mess up with the connection.

I was experiencing a issue that after some time my celery beat (scheduler) was receiving an:
django.db.utils.InterfaceError: cursor already closed and i was trying to figure it out what was causing this in the first place. So before the details, here’s my setup:

Python 3.11

# Python deps
Django==4.2.8
celery[redis]==5.3.6
psycopg2-binary==2.9.9
django-tenants==3.6.1
tenant-schemas-celery==2.2.0

# DB
PostgreSQL 16 (local)
AWS RDS (production)

So, for those that haven’t used tenant-schemas-celery before, basically all we need to know is that for every Celery task that is “beaten” (produced/enqueued) the beat proccess queries the database for all the tenants to produce that task. And this is where the errors where being raised.
After an hour that the proccess was running, it received this error. Then the proccess was restarted and everything’s working again. And again, after one hour the same error is raised.
Here’s my beat_schedule config, for reference:

CELERY_CONFIG = {
  # ... other configs ...
  "beat_schedule": {
        "discover_communication_accounts_every_day_03_01_utc": {
            "task": "comms.tasks.discover_communication_accounts",
            # https://crontab.guru/#0_3
            "schedule": crontab(minute="1", hour="3"),
            "options": {"expires": timedelta(days=1).total_seconds()},
        },
        "niko_niko_check_every_30_minutes": {
            "task": "niko_niko.tasks.niko_niko_send_to_all_employees",
            # https://crontab.guru/every-30-minutes
            "schedule": crontab(minute="*/30"),
            "options": {"expires": timedelta(minutes=29).total_seconds()},
        },
        "employee_send_video_exercise_every_30_minutes": {
            "task": "exercises.tasks.exercise_send_to_all_employees",
            # https://crontab.guru/every-30-minutes
            "schedule": crontab(minute="*/30"),
            "options": {"expires": timedelta(minutes=29).total_seconds()},
        },
    },
}

So here’s how i tried to solve this issue:

  • Verified the CONN_MAX_AGE configuration. Initially this value was not set on my configuration file, i tried some other values directly on production, like: 0 (Closing the connection after the usage, but i guess that nothing was actually closing the session, i monkey-patched the close method of the db Backend and no one was calling it), explicitly setting to None and other small integer values. None of these options made any effect, the error still happened after the exact same time.

  • I tried to reproduce this behavior locally, but didn’t get the same behavior. So something was messing up with the connection on the production database only, and after some time.

  • I noticed that my beat proccess was idle most of the time for the one-hour time span before the error happens, this happens because right now i only have some periodic tasks that happens at the “same time” (every half hour, minutes 0 and 30 of every hour), and the first execution was working flawlessly. But after this first execution the beat proccess would not use the database for ~29 minutes. And on the second execution (almost an hour after the process is started) the connection was broken, raising the error mentioned. So after this, i tried changing my beat_schedule configuration a bit, and changed one of the entries to run every minute, and after this change i didn’t get these errors after one hour.

So my conclusion is that: AWS RDS is closing (or messing up) with the connection because it’s “idle” for a long time. And the solution for me is to have a more “periodic” periodic tasks. I can’t have my beat process idle for that much time, otherwise the connection will be terminated, or interrupted.