Hi all,
I’m dealing with an operations problem using the Heroku platform in which a specific app (we have 10 running the same project) has had a number of dynos, web and worker alike, run into the problem of not being able to connect to the database. In the past, these would occur very infrequently and I’d restart the problematic dyno and things would be good again. Over the past week or so, things have escalated. One of the remediation proposals Heroku’s support provided was to retry the connection to the database. I don’t recall and haven’t been able to find any documentation to have Django retry a connection to the database automatically.
Has anyone done this before outside of dropping down to the cursor level?
From the docs at Databases | Django documentation | Django, it looks to me like Django is going to retry the connections on each page request if there’s no active connection available. If repeated requests to the same url doesn’t work, I would guess that the worker process has some limit of connections currently hung or tied up.
Have you looked at the database to see what connections are being held? Maybe you’ve got something tying up connections?
I’m pretty sure it’s not running out of database connections. The monitor indicates it hovers around 200, when the max available is 500.
Heroku let me know that they don’t see any networking issues on the database server side, so it feels like these dynos for this application simply have networking issues when reaching out. Though I’m not sure how to verify that.
Do you have any kind of shell-level access to them when they’re running? (System shell, not Django shell - although there are things you could try from the Django shell as well.)
If you did, I’d try making some other type of outbound connection from that system - either using the psql client, or some small python program that opens a connection. If those don’t work, then I’d try something like a wget to a known host - or even ssh out to some other server. (The basic idea is to try to isolate the issue as either being on that server side or if the DB server is rejecting the connections for some reason.)
While extremely unlikely, this can also help to determine if you’re running out of available outbound ports. (That is only likely to be an issue if it’s establishing a number of outbound connections to different services - such as using the requests
module for something.)
Does Heroku build on a "container"ish architecture? If so, it might be worth trying to find out if your container has network port limits beyond the natural operating system limits.
Those are solid ideas. I’ll give them a whirl the next time the errors start coming. I can access the machines while they’re running. I’m unsure about Heroku’s internal architecture to be honest.
As a follow-up:
Setting the PGCONNECT_TIMEOUT
environment variable seems to have resolved the errors. I have a heroku support ticket open. They said it has something to do with noisy neighbors on the server causing issues. The other proposed resolution of theirs was to run heroku ps:stop <dyno>.<index>
, but that wasn’t working so well for the problematic app. If I get different information, I’ll share it here.
1 Like