Test time doubled after Django 3.2->4.2 and Postgres 10->13 update

OK. That makes sense. Thanks for explaining that. If that is indeed the cause of the issue, it sounds like maybe this is a problem that is more likely to impact the transition between test classes than in a production environment, where we’re only adding data. The affected test in this case is the first and only test run of a particular test class. The test class run prior to the affected test loads data that should be destroyed. If I’m understanding you correctly, it’s the representation of that destroyed data in the statistics tables that is causing the sub-optimal plan.

Also, this seems to be an issue that is specific to postgres 13. So far we have tried to not be tethered to a specific database architecture, but regardless, let me give VACUUM FULL ANALYZE a shot and I will report back.

I guess I got ahead of myself. I got the vacuum full analyze working. The one scenario I’ve been testing with ran fast each time. So I put it in the tearDownClass method in our TestCase class, thinking I could broadly apply it across all the test classes and address all of the slow tests:

        @classmethod
        def tearDownClass(cls):
            print("Clearing out postgres statistics tables")
            conn = psycopg2.connect()
            conn.autocommit = True
            with conn.cursor() as cursor:
                cursor.execute("VACUUM FULL ANALYZE")
            conn.close()
            super().setUpClass()

but I ended up with a slew of errors ranging from a bunch of assertion failures to a bunch of exceptions.

I’m going to have to come back to this tomorrow. I have to sort all this out.

So we had a meeting today and a point was made that the table joins that make no sense (e.g. a join with ArchiveFile) are in the SQL, which would suggest we have potentially a model issue, that may have to do with recent model changes. Perhaps the (somewhat intermittent) postgres 13 issue is a symptom of that? I know you guys suggested it may be worth taking a look at the models, so if I can ferret anything out about that, I’ll let you know what I find.

I looked into one of the test failures that was new when I tried adding a vacuum full analyze between test classes. I think it turned out that the place where I was doing the vacuum full analyze was somehow causing the test failures and exceptions. It made no sense to me, honestly. Debug prints of a few queries were showing records in the database that could not possibly have been loaded yet. When I moved the vacuum statements from tearDownClass to setUpClass or setUpTestData, the debug prints made sense and the failures and exceptions disappeared. The only thing I can figure is that the vacuum command executes asynchronously and that the records I was seeing must have been from previous tests in other test classes?

But also, when I moved the vacuum command to a place where it doesn’t cause new problems, it turns out to not speed up the queries (whereas having it in tearDownClass sped up that 1 test I was working on).

I also noticed that the unexpected joins… also occur in Django 3.2 and postgres 10, so I no longer think there may be a model issue (though I don’t understand why those joins are made).

I tried even inserting the vacuum command execution at key points in the tests that run particularly long (after the last load command and before the first query). And it still runs slow (though maybe it’s the execution of analyze that takes time for all I know).

I think at this point, I’m ready to give up and just get down to shrinking the data in our tests and put up with the doubling of the running times until that’s done. I will however see about submitting an issue to postgres.

It does not. A vacuum full requires an exclusive lock on the table(s) being cleaned. There shouldn’t be any other activity going on with the tables while that command is running. (Now, if the command is issued while there is other work in progress - where that work is interupted, all bets are off.)

Now, having said that, I could envision a situation where an internal cache somewhere (database? psycopg2 or 3? Django?) ends up out-of-sync - but that seems unlikely at best.

That’s a distinction that may be worth determining. (Or not, if you’re moving ahead with a different solution)

:man_facepalming: I had a copy-paste-o in tearDownClass (which you can see in my earlier comment) that appears to have been causing the test failures/exceptions and potentially the lack of addressing the slowness…

When I fix it, the one slow test I have been working with sometimes yields the following exception from the vacuum command:

======================================================================
ERROR: test_labeled_elements_common_with_compound (DataRepo.tests.loading.test_load_accucor_msruns.IsoCorrDataLoadingTests)
Test to ensure count 0 entries are not created when measured compound doesn't have that element
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/tests/loading/test_load_accucor_msruns.py", line 950, in test_labeled_elements_common_with_compound
    self.vacuum_postgres_stats_table()
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/tests/tracebase_test_case.py", line 89, in vacuum_postgres_stats_table
    cursor.execute("VACUUM FULL ANALYZE")
psycopg2.errors.DataCorrupted: missing chunk number 0 for toast value 2884552 in pg_toast_2619


======================================================================
ERROR: tearDownClass (DataRepo.tests.loading.test_load_accucor_msruns.IsoCorrDataLoadingTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/tests/tracebase_test_case.py", line 59, in tearDownClass
    cls.vacuum_postgres_stats_table()
  File "/Users/rleach/PROJECT-local/TRACEBASE/tracebase/DataRepo/tests/tracebase_test_case.py", line 89, in vacuum_postgres_stats_table
    cursor.execute("VACUUM FULL ANALYZE")
psycopg2.errors.DataCorrupted: missing chunk number 0 for toast value 2884552 in pg_toast_2619

What do you know about this error? And why is it only sometimes happening?

Note, I created a method (vacuum_postgres_stats_table ) for vacuuming so I could easily try inserting extra vacuum commands in a few select places after loading data inside tests themselves to see if it could speed up those tests, but it either has no effect of speeding up the test or the call from tearDownClass yields that exception.

For example, I reran the case that caused the exception above and debug-printed the times taken for various steps and I got:

Clearing out postgres statistics tables
vacuum time: 0.771
pg query time: 0.001
pg count time: 0.075
carbon count time: 242.402
nitrogen count time: 226.590
TEST TIME: DataRepo.tests.loading.test_load_accucor_msruns.IsoCorrDataLoadingTests.test_labeled_elements_common_with_compound: 747.107
.Clearing out postgres statistics tables
vacuum time: 0.604

Vacuuming is always running fast and I don’t know why it occasionally appears to speed up the test (though lately it seems to consistently have no effect). (Note, normally that one test runs in about 400s, but today I’m running on my laptop, which tends to run slower than my desktop at work.)

Just an update. The postgres folks seem to suspect that the slowness may be related to the timing of autovacuum. They say that the plan shows that it thinks that there’s 1 record in the PeakData table when there are 7896 records (which is correct - the test where the slow query is run loads precisely that number of records).

Further, they state that to their knowledge, there should be no difference in performance between v10 and v11+, at least WRT this query/issue, so they seem to have some interest in figuring this out. They also suggest that the DataCorrupted exception suggests a bug - and that there was a separate report recently with a corroborating error.

In the mean time, I have not been able to get the manual application of the vacuum command to nudge a performance bump and I’m not sure why that is. I think that the times I reported success in attempts to use vacuum, it just randomly ran fast, because I do occasionally see fast performance without changing anything.

They suggested that I may be able to tweak the autovacuum timing settings, but I have not yet looked into that. I may take a quick stab at that today.

1 Like

OK. Still don’t know why VACUUM FULL wouldn’t work, but I added the following to setUpClass (based on a blog post I found) and the one test query that almost always runs slow actually ran fast:

            conn = psycopg2.connect(
                database=settings.DATABASES["default"]["NAME"],
                host=settings.DATABASES["default"]["HOST"],
                port=settings.DATABASES["default"]["PORT"],
                user=settings.DATABASES["default"]["USER"],
                password=settings.DATABASES["default"]["PASSWORD"],
            )
            conn.autocommit = True
            with conn.cursor() as cursor:
                cursor.execute("ALTER TABLE \"DataRepo_peakdata\" SET (autovacuum_vacuum_scale_factor = 0.0);")
                cursor.execute("ALTER TABLE \"DataRepo_peakdata\" SET (autovacuum_vacuum_threshold = 5000);")
                cursor.execute("ALTER TABLE \"DataRepo_peakdata\" SET (autovacuum_analyze_scale_factor = 0.0);")
                cursor.execute("ALTER TABLE \"DataRepo_peakdata\" SET (autovacuum_analyze_threshold = 5000);")
            conn.close()

And when I ran the entire test suite on github, it ran in about a half hour (which is on par with before upgrading postgres) instead of an hour.

I will notify the postgres folks.

1 Like

did you find a solution? we migrated from 3.2 to 4.2 and our apis takes 2x to answer, mostly due to connection to the postgres.

Yes, however, we decided not to use the workaround that made things fast. Instead, we just reduced the data used in our test suite and got it down to 5m instead of the 30m it ran in before the Postgres update, because we did not want to employee a performance heuristic that applied to only one particular database architecture. I cannot recall the details of the solution, but the problem had to do with doing queries so soon after inserting data, so it only really applied to tests. If you are having issues in production, it’s possible, it may not be due to this same issue?

I still have the branch with the workaround. If you want, I can look at it and give you a precise answer.

we have problem everywhere, even in simple view that basically does a httpresponse() to return a 200. it seems it may be in some middlewares then,