Hi Ken,
Hit a milestone today, which freed me up to start digging into this row(?) locking issue. I performed 2 tests and confirmed that not only does my data validation page hang during a database load on the back end, but that 1 validation job will cause a subsequent job to hang.
The hang from concurrent validations is not a huge problem ATM, because most validations will take a handful of seconds, but the largest dataset we have thus far can take upwards of 45s to validate. Those large studies are rare so far, but the loads can take on the order of an hour, so until this problem is mitigated, we will have to schedule downtimes for data loads.
After confirming the hang, I started trying the code from the postgres page you linked me to above and was able to see the blocked process when I retried my test cases. Here’s one example:
tracebase=> SELECT blocked_locks.pid ... FROM pg_catalog.pg_locks ... JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid ... JOIN pg_catalog.pg_locks ... JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid ... WHERE NOT blocked_locks.granted;
blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_statement | current_statement_in_blocking_process
-------------+--------------+--------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------
20617 | tracebase | 20599 | tracebase | INSERT INTO "DataRepo_study" ("name", "description") VALUES ('Cold Exposure', 'Double-catheterized mice were placed in 4C or room temperature (~22C) housing without food at ZT2. Length of infusion varied depending on tracer, but all infusions concluded at ZT8 for a total of 6 hours cold exposure. Study performed by Marc and LCMS by Michael.') RETURNING "DataRepo_study"."id" | SELECT 1 AS "a" FROM "DataRepo_animal" WHERE "DataRepo_animal"."id" = 929 LIMIT 1
(1 row)
So that confirms that the hangs I reported were from the locks you suggested may be occurring in the postgres database.
I have a few thoughts/ideas/questions I’d like to bounce off you to see what directions to go here are viable…
In an ideal world (which I don’t expect to be possible because I’m sure there is likely something I’m unaware of that would make this idea cause problems…, but am conveying just to orient you to what’s happening in my validation code), I would be able to tell django/postgres that my validation process (wrapped in an atomic transaction) is running in dry-run mode, so I don’t care about concurrent database updates, because I intend to raise a dry run exception inside the atomic block in any case - so just go ahead and proceed - i.e., don’t wait for any locks that exist. If there are any AutoField
fields, either leave it null or use some discontiguous value that would be ignored by the process that has the lock.
Assuming that wouldn’t work, my next idea would be to queue up these validation jobs (e.g. using Celery), or perhaps for the purposes of a simple example, simply check to see if there’s a lock that would block a validation process and simply report to the user that the server is busy and that they should try again later…
I read up on the docs to see how to tell if there’s a lock and discovered:
qs = Model.objects.select_for_update(nowait=True)
with transaction.atomic():
try:
qs.get_or_create(**rec_dict)
except DatabaseError as de:
# The DB is busy...
I haven’t tried this yet, so I’m not sure if this is correct. The documentation uses .filter()
instead of .get_or_create()
, but all our code uses get_or_create()
, which is the call that is blocked in my tests above, which always hung on Study.objects.get_or_create(**rec_dict)
(according to the SQL logging output) - the first query in a study load or validation. Will select_for_update(nowait=True)
work when doing a get_or_create()
or do I need to do a filter(**rec_dict)
first?
Thanks in advance for sharing your knowledge on this topic.