Thanks! I have a follow-up question…
I thought the suggestion made in that above referenced article made a lot of sense. I came across it because I was discussing another topic (you aided me with before) on the postgres general email list. They pretty much confirmed my assessment that the blocking behavior was unavoidable with concurrent queries that make inserts/updates.
Even if I relegated long running loads to off-peak hours to avoid the blocking behavior that creates, concurrent validations can still block one another, based on what I’ve learned this week. The data validation process must involve inserts^, and if 2 concurrent validations are doing creates in a transaction and each is inserting common data (like a new tissue type or treatment - or maybe 2 researchers working on the same study data), there is bound to be a block.
^ The only way around avoiding inserts for validation would be if I were to re-invent the checks the database makes using data structures representing all the inserted data, since it contains interdependent data in multiple tables.
Thus, they suggested that the solution is to separate those validation jobs, i.e. queue them up and execute them serially, e.g. using celery
.
I intend to straight-up implement the celery job queuing for every validation, but I also separately like the idea of implementing the statement timeout for all views in the entire site, so that timeouts can be handled gracefully. The problem is that I would not want that statement timeout applied to the celery-queued jobs, so my question is this:
Assuming we are not using persistent database connections, and we have the signal for connection creation automatically apply the statement timeout to every connection, if I remove the statement timeout by setting it in the view where the request comes in, any celery job I queue up wouldn’t be subject to the timeout, right? It’s been awhile since I played around with celery. Do jobs launched from celery get their own database connection created?
The other thing, that is just occurring to me now, is that I should probably be using a connection timeout instead of a statement timeout, because the statement timeout is applied to each statement, thus I could still hit a gateway timeout if cumulatively, the statements exceed the server’s timeout?
However, can that timeout be modified when the request comes in, in the same way the statement timeout can? And would that work on other database architectures?