Operability during long running loads

hepcat72 · October 21, 2024, 9:01pm

I have noticed that when I’m running a long running load on the command line, certain features on our site that make database changes time out (take >60s).

Could this be due to the load running under an atomic transaction? Any tips on how I can figure out what is causing the hangs during a long running load?

KenWhitesell · October 21, 2024, 10:11pm

That would be my initial guess.

Assuming PostgreSQL, see Lock Monitoring - PostgreSQL wiki
(There are other pages on that wiki with various queries for showing internal information. See Monitoring - PostgreSQL wiki)

If you’re not using PostgreSQL, I’m sure your database engine-of-choice has some similar features. I know that both Oracle and DB2 do. I would be willing to assume something like that would be available for both MSSQL and MySQL/MariaDB.

hepcat72 · November 19, 2024, 12:02am

Hi Ken,

Hit a milestone today, which freed me up to start digging into this row(?) locking issue. I performed 2 tests and confirmed that not only does my data validation page hang during a database load on the back end, but that 1 validation job will cause a subsequent job to hang.

The hang from concurrent validations is not a huge problem ATM, because most validations will take a handful of seconds, but the largest dataset we have thus far can take upwards of 45s to validate. Those large studies are rare so far, but the loads can take on the order of an hour, so until this problem is mitigated, we will have to schedule downtimes for data loads.

After confirming the hang, I started trying the code from the postgres page you linked me to above and was able to see the blocked process when I retried my test cases. Here’s one example:

tracebase=> SELECT blocked_locks.pid ... FROM  pg_catalog.pg_locks ... JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid ... JOIN pg_catalog.pg_locks  ... JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid ... WHERE NOT blocked_locks.granted;

 blocked_pid | blocked_user | blocking_pid | blocking_user |                                                                                                                                                                                                            blocked_statement                                                                                                                                                                                                            |                       current_statement_in_blocking_process                       
-------------+--------------+--------------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------
       20617 | tracebase    |        20599 | tracebase     | INSERT INTO "DataRepo_study" ("name", "description") VALUES ('Cold Exposure', 'Double-catheterized mice were placed in 4C or room temperature (~22C) housing without food at ZT2. Length of infusion varied depending on tracer, but all infusions concluded at ZT8 for a total of 6 hours cold exposure. Study performed by Marc and LCMS by Michael.') RETURNING "DataRepo_study"."id" | SELECT 1 AS "a" FROM "DataRepo_animal" WHERE "DataRepo_animal"."id" = 929 LIMIT 1
(1 row)

So that confirms that the hangs I reported were from the locks you suggested may be occurring in the postgres database.

I have a few thoughts/ideas/questions I’d like to bounce off you to see what directions to go here are viable…

In an ideal world (which I don’t expect to be possible because I’m sure there is likely something I’m unaware of that would make this idea cause problems…, but am conveying just to orient you to what’s happening in my validation code), I would be able to tell django/postgres that my validation process (wrapped in an atomic transaction) is running in dry-run mode, so I don’t care about concurrent database updates, because I intend to raise a dry run exception inside the atomic block in any case - so just go ahead and proceed - i.e., don’t wait for any locks that exist. If there are any AutoField fields, either leave it null or use some discontiguous value that would be ignored by the process that has the lock.

Assuming that wouldn’t work, my next idea would be to queue up these validation jobs (e.g. using Celery), or perhaps for the purposes of a simple example, simply check to see if there’s a lock that would block a validation process and simply report to the user that the server is busy and that they should try again later…

I read up on the docs to see how to tell if there’s a lock and discovered:

qs = Model.objects.select_for_update(nowait=True)
with transaction.atomic():
    try:
        qs.get_or_create(**rec_dict)
    except DatabaseError as de:
        # The DB is busy...

I haven’t tried this yet, so I’m not sure if this is correct. The documentation uses .filter() instead of .get_or_create(), but all our code uses get_or_create(), which is the call that is blocked in my tests above, which always hung on Study.objects.get_or_create(**rec_dict) (according to the SQL logging output) - the first query in a study load or validation. Will select_for_update(nowait=True) work when doing a get_or_create() or do I need to do a filter(**rec_dict) first?

Thanks in advance for sharing your knowledge on this topic.

hepcat72 · November 19, 2024, 3:53pm

Been exploring options. What does transaction.non_atomic_transaction do exactly in the context of parent and/or child transaction.atomic decorators? I wasn’t quite clear from the docs, as it doesn’t say how it works in relation to the other decorator, but reading between the lines, it seems that it is only intended to apply to ATOMIC_REQUESTS. Does that mean that the decorators are mutually exclusive (i.e. you can’t nest them together)? Intuitively, I would have guessed that a parent transaction.atomic decorator would still be in effect and that child transaction.atomic decorators would be ignored…

And if I were to push forward with the celery route, would it be possible to tell users how many blocking processes started by other users are in front of them? I.e. can I say, “Your job has been queued. There are 2 users in front of you.” My guess is that this isn’t a feature of Django and if it were to be implemented, I’d have to make custom SQL queries about the number of blocked processes (with some assumptions/inferences about wait order). [I might be able to at least tell the user that their job is waiting to run by taking advantage of select_for_update…]

KenWhitesell · November 19, 2024, 4:03pm

You’re reaching the edge of my current knowledge here. I understand your questions, but I’m not comfortable offering thoughts or opinions on them yet.

Regarding your first post (access tables despite locks), in general terms I think it’s probably possible by changing the isolation level of the connection. No clue what the implications of that may be in a Django environment. (I haven’t had to do anything like that since my old DB2 days back in the '90s.)

As for the rest, I’d need to do some reading & thinking about it…

hepcat72 · November 19, 2024, 4:47pm

My reading yesterday touched a little bit on the concept of isolation levels. I thought it seemed promising at first, but other than select_for_update, which offers nowait and skip_locked, I couldn’t find any inlets that would allow me the control to say “just ignore DB locks and proceed”. My intuition lead me to take a look at the arguments that could be supplied to atomic(), but I didn’t see anything there.

It seems to me that it would be more complex than just to tell 1 process to ignore locks, because I’m not sure if that would mean that a concurrent load would cause it to incorrectly increment an AutoField or cause some other problem… but then again, I guess that speaks to “isolation levels” that I need to learn more about…

I appreciate your response, BTW. Thanks so much for giving this some thought.

Incidentally, we demoed our latest version to the various lab members that use it yesterday and highlighted the improved validation process, which signified the end of a load process refactor that properly wraps everything in atomic transactions and fixes a bunch of problems with the previous version. It went well. I would have liked every lab member in that workshop to create an example submission on their laptops, but with this concurrency issue, we just couldn’t do it. There are few enough lab folks using the interface (and their data is usually small enough) that timeouts will be rare, but we’re planning to add a new feature that will automate data analysis and increase the load size by up to 2 orders of magnitude, so before that can happen, this concurrency issue will have to be solved.

hepcat72 · November 19, 2024, 5:24pm

OK. It looks like what I want is the Repeatable Read Isolation Level.

However, I don’t think I want that for database load processed and admin page edits. I would still want those to use Read Committed, I think. (Plus, that is postgres-specific.) But I suppose that maybe it could be tolerated site-wide since record edits and loads can only be performed by admins, so we could make sure they happen serially…

From looking at the django documentation, I only see site-wide settings for isolation levels, if I wanted a Repeatable Read isolation level. I don’t see a context manager to be able to set my validation process only to Repeatable Read.

KenWhitesell · November 19, 2024, 5:43pm

Technically, I believe it’s at the connection level and not “site-wide”. I’m reasonably certain that you could create two connections to the same database, but with different isolation levels with each. Then, in your ORM statements, select the “out-of-the-ordinary” level by using the using clause to select the secondary connection.
(I’ve done stuff like this for secondary connections using different user ids and schemas but not isolation levels, so I know it works on general principles. Whether it’s going to have the specific semantics desired in your case is a different issue.)

Unfortunately, I’m truly tied up at work this afternoon and can’t give this the thought / experimentation it deserves (yet). (I’m finding this interesting because it has been a long time since I’ve had to deal with anything of this nature, and it is an intriguing issue.)

hepcat72 · November 19, 2024, 6:34pm

Huh. That’s an interesting idea. Given my experience with the using clause in the past to maintain a separate database just for these validation processes (which I completely ripped out and replaced with atomic transactions), I’m really reluctant to start conditionally changing connections. The code that does the validation is the same code that does the load, and we still have some property methods that make independent queries, whose using argument is hacky to change…

The more I think about it, the more I’m leaning toward actually changing the isolation level site-wide to use the more strict Repeatable Read isolation level. The main reason being, the operations that change the database are admin level features only. Loading, the admin edit interface, the shell, the database prompt - users don’t have access to those. And since the validation interface never commits anything, the only people that could encounter a SerializationError are admins - if for example, we attempt to use the admin edit page during a load. We can just adopt some business logic to avoid that. We could even disable the admin interface during a load…

I’m really starting to get excited by the fact that this is a really simple fix. It’s just a single setting, and if anyone else were to install our software after we publish and they use SQLite for example, which doesn’t support Repeatable Read, the site will still work - they just may run into timeouts during validations (which we can fix by using celery).

hepcat72 · November 19, 2024, 9:58pm

Hmmm… my debugging attempts may have been flawed from the start. I have been completely unable to reproduce the original problem on our development server. This whole thing might be a red herring.

When I couldn’t reproduce the blocking behavior on the dev server with either concurrent validations or with a concurrent load, I began to think that python manage.py runserver may have been an inadequate platform for reproducing this problem. I think I remember reading a long time ago that it only ever uses a single process? So of course it blocks?

I’m not sure that explains the block when I tried a load in one terminal and a validation via python manage.py runserver in the other, but even in that case, I couldn’t reproduce the blocking behavior on the dev site.

So it may be the case that when we encountered the problem a few months ago (and created the github issue), it was related to something else - either a seemingly unrelated bug that was fixed or due to user error (before the validation interface was finalized, it accepted inputs that have too long of a running time - and those inputs are now prohibited). Or maybe there’s a difference in the environment, like a different postgres version…

Regardless, I apologize for wasting your time.

KenWhitesell · November 20, 2024, 4:43am

Thanks for the update, I appreciate you passing along what you have discovered and knowing that there may be other factors involved.

Also, absolutely no apologies necessary - these are the types of issues I really like. They’re close enough to what I do know to be able to follow along, but different enough that I need to learn about some things that I’m not that familiar with, to try to be able to contribute something. (I love learning new things!)

hepcat72 · November 24, 2024, 11:50pm

Well darn it! I ran into the blocked process again, but this time on the dev site (not in my sandbox), so it looks like it DOES get blocked sometimes. I just don’t know how to reliably reproduce it. The dev site has a really long timeout, but the page I was submitted should have definitely taken under a minute. I’ve run this same file multiple times before. When it was taking many minutes, I decided to check for a blocked query/command. Here is what I posted on the issue I created on our github repo for this:

I finally ran into the block on tracebase-dev! I had a load going on the command line for the cold exposure study (running load_peak_annotation_files in dry run mode) to test a bugfix and while that was going, I wanted to run the validation page on the same file to check for the missing sample errors I was seeing (because I wasn’t loading the entire study this time - because it should already be totally loaded - so I wanted to check if the validation said that it would load the samples). When it was going for like 7 minutes, I decided to log into the database and run the command to look for blocked queries. The below is that attempt…

(tracebase) [tracebase@tb9-dev tracebase]$ psql -h REDACTED -d tracebase -U REDACTED
Password for user tracebase: 
psql (13.16)
Type "help" for help.

tracebase=# SELECT blocked_locks.pid     AS blocked_pid,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.pid     AS blocking_pid,
         blocking_activity.usename AS blocking_user,
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks 
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid

    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.granted;
 blocked_pid | blocked_user | blocking_pid | blocking_user |          blocked_statement          |                                                                                                                                                                                                                           
                                                                                                           current_statement_in_blocking_process                                                                                                                                                                             
                                                                                                                                                         
-------------+--------------+--------------+---------------+-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------
      706234 | tracebase    |       704880 | tracebase     | DELETE FROM "tracebase_cache_table" | SELECT "DataRepo_peakdatalabel"."id", "DataRepo_peakdatalabel"."peak_data_id", "DataRepo_peakdatalabel"."element", "DataRepo_peakdatalabel"."count", "DataRepo_peakdatalabel"."mass_number" FROM "DataRepo_peakdatalabel" 
INNER JOIN "DataRepo_peakdata" ON ("DataRepo_peakdatalabel"."peak_data_id" = "DataRepo_peakdata"."id") INNER JOIN "DataRepo_peakgroup" ON ("DataRepo_peakdata"."peak_group_id" = "DataRepo_peakgroup"."id") WHERE "DataRepo_peakdatalabel"."peak_data_id" = 366938 ORDER BY "DataRepo_peakdatalabel"."element" ASC, "DataRepo
_peakdatalabel"."count" ASC, "DataRepo_peakdatalabel"."mass_number" ASC, "DataRepo_peakgroup"."name" ASC, "DataRepo_peakdata"."corrected_abundance" DESC
(1 row)

tracebase=#

I’ll have to look into this more, but it would be nice if I could figure out how to reproduce this issue reliably…

hepcat72 · November 25, 2024, 12:01am

Oh! Now that I actually look at the blocked query, it’s a delete of a cache table record! The script I’m running must not disable cache updates! That’s a whole new aspect for this issue. Now I have something to investigate!

Topic		Replies	Views
Question about transactions and timeouts Using Django	0	59	September 3, 2024
Connection lock hanging after upgrading psycopg Mystery Errors	3	223	May 5, 2024
Test time doubled after Django 3.2->4.2 and Postgres 10->13 update Using the ORM	30	2244	April 5, 2024
Handling migrations that require downtime Using Django	3	2568	January 31, 2021
SQLite and database is locked error ORM	11	5996	January 18, 2024

Operability during long running loads

Related topics