Flaky cache tests

nessita · January 15, 2024, 8:31pm

Hello everyone!

I’ve been paying attention to some flaky tests in CI, with the goal of determining if we can fix them. In my opinion, flaky tests are a distraction and may consume contributors’ and fellows’ time, as failed tests in PRs can lead to potentially duplicate efforts in debugging and fixing them.

Lately I’ve seen a pattern of flaky tests for cache tests. There were a few occurrences regarding the filebased cache which I mitigated by proposing PR #17614. Recently I’ve seen two failures related to memcached backend (see [0] and [1], I’m including screenshots because the jenkins run logs get pruned).

For both of them, I think the issue is related to slow memcached and/or slow network access to operate/write and read from the cache. These of course are transient failures, but a distraction nevertheless. Specifically, these two pieces of test code:

    494         cache.set("expire1", "very quickly", timeout=1)                        
    495         self.assertIs(cache.touch("expire1", timeout=4), True)

and

    635         cache.set("key5", "belgian fries", timeout=1)                          
    636         self.assertIs(cache.touch("key5", timeout=None), True)

assume that setting a key in the cache and then doing a touch would take less than a second. My guess is that the failure is caused by having the whole execution of setting a key and the start of executing the touch taking a bit more than 1 second so the touch operation fails (AFAIU touch needs the key to be valid in the backend to touch it, so if it’s expired, it fails).

How to fix this? Well we could increase the first set timeout to be 2 seconds to account for high latency/slower access (and adjust subsequent calls accordingly), but I dislike that approach because the cache tests will be yet slower than what they already are (there are plenty of sleep calls which makes that part of the suite noticeable slow).

Do you have any other ideas? Thank you!
Natalia.

[0]

[1]

carltongibson · January 16, 2024, 7:42am

Hey @nessita — good question

I’m not a big fan of mocking, but what are we testing here? That memcached works? If we can assume that, then perhaps just checking that cache.set makes the correct call into the underlying library is sufficient?

(TBH I’m not even sure I believe that myself, but… )

nessita · January 19, 2024, 4:14pm

@carltongibson - you are spot on, I don’t think we want to test that memcached works. Still, and considering how little mocking is used in the Django source code (I love this fact), a change of this sort would feel… either inconsistent if we apply it to a single test, or too invasive if we apply it across all/most cache tests.

I’ve seen one other cache-related error in a recent main-random run[0], but I think that’s not necessarily related to timing issues. I’ll keep an eye on this… thank you for your response!

[0] The traceback:

======================================================================
FAIL [0.004s]: test_zero_cull (cache.tests.LocMemCacheTests.test_zero_cull)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/main-random/database/postgis/label/focal/python/python3.12/tests/cache/tests.py", line 684, in test_zero_cull
    self._perform_cull_test("zero_cull", 50, 19)
  File "/home/jenkins/workspace/main-random/database/postgis/label/focal/python/python3.12/tests/cache/tests.py", line 678, in _perform_cull_test
    self.assertEqual(count, final_count)
AssertionError: 20 != 19
----------------------------------------------------------------------

carltongibson · January 19, 2024, 4:26pm

Yes. Exactly. It wouldn’t be my first choice, certainly.

timgraham · January 20, 2024, 1:09am

Ticket #32831 describes the issue and includes past attempts to fix it.

Topic		Replies	Views
Implement and test cache busting Deployment	8	2568	November 12, 2022
Test time doubled after Django 3.2->4.2 and Postgres 10->13 update Using the ORM	30	2363	April 5, 2024
Get test suite timing breakdown with `django-timed-tests` Show & Tell	0	890	January 4, 2022
Invalidating Cache doesn't happen automatically Forms & APIs	10	2912	May 20, 2022
Cache framework support for 204 response? Django Internals	3	463	January 21, 2023

Flaky cache tests

Related topics