InterfaceError and TransactionManagementError (PostgreSQL)

amureki · November 10, 2022, 10:01am

Greetings dear community.

Calling for your help here.
What we are experiencing a lot is “InterfaceError: connection already closed” and “TransactionManagementError”. They are not necessarily accompany each other, but both are consistently happening.
PG logs are not showing anything unexpected or suspicious.
Affected endpoints are different, and I see no common behaviour in them. For example, it sometimes fails even in the admin interface during GET requests.
It does not happen locally.

Our setup pretty standard:

Python 3.11, Django 4.1. PostgreSQL 14.5.
We are hosted on Heroku, waitress works as a webserver (6 threads), 2-5 servers (dynos) could be running.
On top of that, we run 1-2 dramatiq workers (1 process, 4 threads each).
Database statement timeout is set to 10000 (10 seconds).
And we are using persistent database connections (we tried setting DATABASE_CONN_MAX_AGE between 10 and 600 seconds).
We are not using PGBouncer.
There are no issues with reaching connection limits, we have plenty of space there.

I am currently out of ideas how to debug this, and would love your advices or any help.

amureki · November 10, 2022, 12:02pm

Here, I am going to share example tracebacks captured by Sentry:

InterfaceError
Collapsed:
CleanShot 2022-11-10 at 12.57.191402×941 137 KB

The only custom code:

TransactionManagementError:

CleanShot 2022-11-10 at 13.01.251459×967 153 KB

KenWhitesell · November 10, 2022, 12:27pm

Django third-party libraries are rarely guaranteed to be thread safe. Depending upon what Django-related packages you have installed, one of them may be the source of the problem.

I always suggest running multiple separate processes for your Django instances with single threads per process.

Also, Python 3.11 compatibility isn’t supported in 4.1 before 4.1.3 - make sure your Django is up-to-date.

codingjoe · November 10, 2022, 2:06pm

Multithreading wouldn’t explain dropped connections, thought, right? It would only explain obsolete open connections. However, those wouldn’t be used in a running thread.

amureki · November 10, 2022, 4:08pm

Thanks for the answer!

I did skim through our packages and there is nothing that could mess up with database connections.
Also, as @codingjoe mentioned that symptoms would be different - I do not see any burst of obsolete connections, but rather ones I am using are being randomly dropped.

Regarding Python and Django versions - we are running current latest Django (4.1.3) and this issue was there on Python 3.10.6 / Django 4.0.7 as well.

KenWhitesell · November 10, 2022, 4:59pm

That’s not the symptoms that we have seen in that case. The connections can be opened / closed irregularly based upon load and activity.

I don’t have an explanation, merely the observation that our connection-related issues went away (along with a number of other occasional “odd” errors) when we switched to a “process-only” deployment.

amureki · November 18, 2022, 12:55pm

Okay, my latest findings show that Django opens up a connection and almost instantly closes it.
This happens in this part:

github.com

django/django/blob/7dfd29b84e5a27d61ae62cb5ed9122d5a99ee1be/django/db/backends/base/base.py#L590


      
                  self.connection is None
                  or not self.health_check_enabled
                  or self.health_check_done
              ):
                  return
          
          
    if not self.is_usable():
                  self.close()
              self.health_check_done = True
          
          
def close_if_unusable_or_obsolete(self):
              """
              Close the current connection if unrecoverable errors have occurred
              or if it outlived its maximum age.
              """
              if self.connection is not None:
                  self.health_check_done = False
                  # If the application didn't restore the original autocommit setting,
                  # don't take chances, drop the connection.
                  if self.get_autocommit() != self.settings_dict["AUTOCOMMIT"]:
                      self.close()

    def close_if_unusable_or_obsolete(self):
        """
        Close the current connection if unrecoverable errors have occurred
        or if it outlived its maximum age.
        """
        if self.connection is not None:
            self.health_check_done = False
            # If the application didn't restore the original autocommit setting,
            # don't take chances, drop the connection.
            if self.get_autocommit() != self.settings_dict["AUTOCOMMIT"]:
                self.close()
                return

I guess, it might be due to atomic transactions (the only thing I know that will turn off autocommit mode), but I am not sure how and why. As in case of GET request to admin object detail page, it does not make any sense…

tsantor · June 11, 2023, 4:36am

Any luck with this? We’re running into the same issue…it’s completely intermittent. Been burning so much time trying to find out why and how to fix this issue.

amureki · June 16, 2023, 12:05pm

Hey @tsantor ! Sadly, there is no positive updates from my end.

We recently updated to Django 4.2 and now trying to switch to psycopg3.
Sadly, both changes seem not to resolve the issue (but I need to test it properly).

Could you, please share your stack and the way you host your server? I wonder if this has something to do with Heroku…

Best,
Rust

lv16 · November 25, 2023, 2:39am

Exact same issue for me, no solutions yet, I gave more details on my config here: Seemingly random PostgreSQL connection drops while handling Django requests - #19 by lv16

amureki · January 15, 2024, 12:35pm

I am considering this topic resolved, as in our case we were able to pin the issue down to whitenoise version. Using whitenoise==5.3.0 resolved all the problems. Kudos @kgrinberg for the hint!

I also posted the issue to whitenoise repository for the follow-up tracking:

github.com/evansd/whitenoise

Random PostgreSQL connection errors

opened 09:42AM - 12 Jan 24 UTC

amureki

### Python Version 3.12 ### Django Version 4.2.9 ### Package Version 6.6.0 … ### Description Greetings dear fellows! We are experiencing a lot of “InterfaceError: connection already closed” and “TransactionManagementError”. They are not necessarily accompany each other, but both are consistently happening. PG logs are not showing anything unexpected or suspicious, but Django closing the connections. Affected endpoints are different, and I see no common behaviour in them. For example, it sometimes fails even in the basic non-customized admin interface during GET requests. It does not happen locally. We did post more details in the following thread: https://forum.djangoproject.com/t/interfaceerror-and-transactionmanagementerror-postgresql/16924 Later I noticed other people were sharing the similar issues: - https://forum.djangoproject.com/t/seemingly-random-postgresql-connection-drops-while-handling-django-requests/19026 - https://forum.djangoproject.com/t/django-db-issue-interface-error-connection-closed/21535 And [one recent comment](https://forum.djangoproject.com/t/seemingly-random-postgresql-connection-drops-while-handling-django-requests/19026/20) was about suspecting a WhiteNoise upgrade. We **did downgrade WhiteNoise version from 6.6.0 to 5.3.0** (one major version) and **it indeed resolved all mentioned issues** on our instances. I am struggling to pin the issue down to the change, but happy to assist and provide any other details. Best, Rust

We will try to find out what exactly is troubling there.

leandrodesouzadev · February 27, 2024, 9:46am

I’m going to share my experience with this, is not totally related to your setup, but i think this may help someone on the wild later on (possibly me).

TL:DR:

Don’t have a proccess that uses the database, and have it idle for a long time on AWS RDS. It will mess up with the connection.

I was experiencing a issue that after some time my celery beat (scheduler) was receiving an:
django.db.utils.InterfaceError: cursor already closed and i was trying to figure it out what was causing this in the first place. So before the details, here’s my setup:

Python 3.11

# Python deps
Django==4.2.8
celery[redis]==5.3.6
psycopg2-binary==2.9.9
django-tenants==3.6.1
tenant-schemas-celery==2.2.0

# DB
PostgreSQL 16 (local)
AWS RDS (production)

So, for those that haven’t used tenant-schemas-celery before, basically all we need to know is that for every Celery task that is “beaten” (produced/enqueued) the beat proccess queries the database for all the tenants to produce that task. And this is where the errors where being raised.
After an hour that the proccess was running, it received this error. Then the proccess was restarted and everything’s working again. And again, after one hour the same error is raised.
Here’s my beat_schedule config, for reference:

CELERY_CONFIG = {
  # ... other configs ...
  "beat_schedule": {
        "discover_communication_accounts_every_day_03_01_utc": {
            "task": "comms.tasks.discover_communication_accounts",
            # https://crontab.guru/#0_3
            "schedule": crontab(minute="1", hour="3"),
            "options": {"expires": timedelta(days=1).total_seconds()},
        },
        "niko_niko_check_every_30_minutes": {
            "task": "niko_niko.tasks.niko_niko_send_to_all_employees",
            # https://crontab.guru/every-30-minutes
            "schedule": crontab(minute="*/30"),
            "options": {"expires": timedelta(minutes=29).total_seconds()},
        },
        "employee_send_video_exercise_every_30_minutes": {
            "task": "exercises.tasks.exercise_send_to_all_employees",
            # https://crontab.guru/every-30-minutes
            "schedule": crontab(minute="*/30"),
            "options": {"expires": timedelta(minutes=29).total_seconds()},
        },
    },
}

So here’s how i tried to solve this issue:

Verified the CONN_MAX_AGE configuration. Initially this value was not set on my configuration file, i tried some other values directly on production, like: 0 (Closing the connection after the usage, but i guess that nothing was actually closing the session, i monkey-patched the close method of the db Backend and no one was calling it), explicitly setting to None and other small integer values. None of these options made any effect, the error still happened after the exact same time.
I tried to reproduce this behavior locally, but didn’t get the same behavior. So something was messing up with the connection on the production database only, and after some time.
I noticed that my beat proccess was idle most of the time for the one-hour time span before the error happens, this happens because right now i only have some periodic tasks that happens at the “same time” (every half hour, minutes 0 and 30 of every hour), and the first execution was working flawlessly. But after this first execution the beat proccess would not use the database for ~29 minutes. And on the second execution (almost an hour after the process is started) the connection was broken, raising the error mentioned. So after this, i tried changing my beat_schedule configuration a bit, and changed one of the entries to run every minute, and after this change i didn’t get these errors after one hour.

So my conclusion is that: AWS RDS is closing (or messing up) with the connection because it’s “idle” for a long time. And the solution for me is to have a more “periodic” periodic tasks. I can’t have my beat process idle for that much time, otherwise the connection will be terminated, or interrupted.

Topic		Replies	Views
Django DB issue (Interface error: connection closed) Mystery Errors	2	1534	November 25, 2023
Seemingly random PostgreSQL connection drops while handling Django requests Mystery Errors	20	5362	January 15, 2024
ASGI "leaking" postgres connections Using Django	3	1803	June 4, 2020
OperationalError: sorry, too many clients already Deployment	30	16865	August 11, 2023
Intermittent PostgreSQL connection errors Mystery Errors	2	723	June 21, 2024

InterfaceError and TransactionManagementError (PostgreSQL)

Related Topics