Flaky cache tests

Hello everyone!

I’ve been paying attention to some flaky tests in CI, with the goal of determining if we can fix them. In my opinion, flaky tests are a distraction and may consume contributors’ and fellows’ time, as failed tests in PRs can lead to potentially duplicate efforts in debugging and fixing them.

Lately I’ve seen a pattern of flaky tests for cache tests. There were a few occurrences regarding the filebased cache which I mitigated by proposing PR #17614. Recently I’ve seen two failures related to memcached backend (see [0] and [1], I’m including screenshots because the jenkins run logs get pruned).

For both of them, I think the issue is related to slow memcached and/or slow network access to operate/write and read from the cache. These of course are transient failures, but a distraction nevertheless. Specifically, these two pieces of test code:

    494         cache.set("expire1", "very quickly", timeout=1)                        
    495         self.assertIs(cache.touch("expire1", timeout=4), True)                 

and

    635         cache.set("key5", "belgian fries", timeout=1)                          
    636         self.assertIs(cache.touch("key5", timeout=None), True)                 

assume that setting a key in the cache and then doing a touch would take less than a second. My guess is that the failure is caused by having the whole execution of setting a key and the start of executing the touch taking a bit more than 1 second so the touch operation fails (AFAIU touch needs the key to be valid in the backend to touch it, so if it’s expired, it fails).

How to fix this? Well we could increase the first set timeout to be 2 seconds to account for high latency/slower access (and adjust subsequent calls accordingly), but I dislike that approach because the cache tests will be yet slower than what they already are (there are plenty of sleep calls which makes that part of the suite noticeable slow).

Do you have any other ideas? Thank you!
Natalia.

[0]


[1]

Hey @nessita — good question :thinking:

I’m not a big fan of mocking, but what are we testing here? That memcached works? If we can assume that, then perhaps just checking that cache.set makes the correct call into the underlying library is sufficient?

(TBH I’m not even sure I believe that myself, but… :stuck_out_tongue_winking_eye:)

@carltongibson - you are spot on, I don’t think we want to test that memcached works. Still, and considering how little mocking is used in the Django source code (I love this fact), a change of this sort would feel… either inconsistent if we apply it to a single test, or too invasive if we apply it across all/most cache tests.

I’ve seen one other cache-related error in a recent main-random run[0], but I think that’s not necessarily related to timing issues. I’ll keep an eye on this… thank you for your response!

[0] The traceback:

======================================================================
FAIL [0.004s]: test_zero_cull (cache.tests.LocMemCacheTests.test_zero_cull)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/main-random/database/postgis/label/focal/python/python3.12/tests/cache/tests.py", line 684, in test_zero_cull
    self._perform_cull_test("zero_cull", 50, 19)
  File "/home/jenkins/workspace/main-random/database/postgis/label/focal/python/python3.12/tests/cache/tests.py", line 678, in _perform_cull_test
    self.assertEqual(count, final_count)
AssertionError: 20 != 19
----------------------------------------------------------------------
1 Like

Yes. Exactly. It wouldn’t be my first choice, certainly.

Ticket #32831 describes the issue and includes past attempts to fix it.

2 Likes