What connection does a command line script use?

I was reading this article about setting a postgres statement timeout, and I am a little fuzzy on how this strategy is claimed to work. It uses a method set up to receive connection creation signals with @receiver(connection_created), and that by simply putting it in wsgi.py means that the statement timeout only would apply to (for example) views.

I get that the signal receiver would not be set up if that file wasn’t imported, for example, in a command line script, but what I’m not clear on is what connection a command line script gets if you run it on the same server that hosts the web server. Does it create its own connection or could it potentially get an existing connection that the web server is using (and thus be subject to the timeout)?

I consulted the connection management docs and the management commands docs, and I didn’t find an explicit mention of where a management command gets its database connection from…

I feel like it’s safe to assume that it creates a new connection of its own and thus would not be subject to the timeouts defined in wsgi.py, but it would be nice to formally confirm that management commands don’t touch the pool of connections that the web server keeps.

Absolutely correct. Your scripts run in a separate process, it’s a separate and independent instance of the Python runtime - there’s no sharing of anything in the Python environment between them. The only point of intersection would be the data in the database itself.

1 Like

Thanks! I have a follow-up question…

I thought the suggestion made in that above referenced article made a lot of sense. I came across it because I was discussing another topic (you aided me with before) on the postgres general email list. They pretty much confirmed my assessment that the blocking behavior was unavoidable with concurrent queries that make inserts/updates.

Even if I relegated long running loads to off-peak hours to avoid the blocking behavior that creates, concurrent validations can still block one another, based on what I’ve learned this week. The data validation process must involve inserts^, and if 2 concurrent validations are doing creates in a transaction and each is inserting common data (like a new tissue type or treatment - or maybe 2 researchers working on the same study data), there is bound to be a block.

^ The only way around avoiding inserts for validation would be if I were to re-invent the checks the database makes using data structures representing all the inserted data, since it contains interdependent data in multiple tables.

Thus, they suggested that the solution is to separate those validation jobs, i.e. queue them up and execute them serially, e.g. using celery.

I intend to straight-up implement the celery job queuing for every validation, but I also separately like the idea of implementing the statement timeout for all views in the entire site, so that timeouts can be handled gracefully. The problem is that I would not want that statement timeout applied to the celery-queued jobs, so my question is this:

Assuming we are not using persistent database connections, and we have the signal for connection creation automatically apply the statement timeout to every connection, if I remove the statement timeout by setting it in the view where the request comes in, any celery job I queue up wouldn’t be subject to the timeout, right? It’s been awhile since I played around with celery. Do jobs launched from celery get their own database connection created?

The other thing, that is just occurring to me now, is that I should probably be using a connection timeout instead of a statement timeout, because the statement timeout is applied to each statement, thus I could still hit a gateway timeout if cumulatively, the statements exceed the server’s timeout?

However, can that timeout be modified when the request comes in, in the same way the statement timeout can? And would that work on other database architectures?

Does Celery create separate connections from your regular Django project? Yes. Again, it’s a separate process from your regular Django project. Whatever configuration settings are used for Django are irrelevent to Celery. (Now, it’s not unusual to use the same settings file for your Celery workers as you do your Django project, so the settings would be the same, but this is not a fundamental requirement. Your Celery tasks do not need to be part of your Django project.)

Again, the connection point between Django and Celery is the message broker. (In my case, it’s Redis.)

Each worker process creates its own connection. (I do not know whether this same connection is kept between individual tasks, or if the connection is closed and reopened for each task or for a bundle of tasks.)

You’re kinda losing me here. What “connection timeout”?

Looking at https://www.postgresql.org/docs/current/runtime-config-client.html, I see:
statement_timeout, transaction_timeout, lock_timeout, idle_in_transaction_session_timeout, and idle_session_timeout.

(I don’t consider Django’s CONN_MAX_AGE to be a “timeout” in that it doesn’t trigger anything to happen when it has passed. It just causes the next request to create a new connection.)

1 Like

Ah yes, you inferred correctly my misinterpretation. This is a new realm of Django to me and I haven’t read about CONN_MAX_AGE yet, and I had indeed assumed it was a timeout.

Looking at the list of options you provided, it looks like transaction_timeout should give me nearly exactly what I’m looking for.

I’ll start reading up on these.