Hello everyone!
I’ve been paying attention to some flaky tests in CI, with the goal of determining if we can fix them. In my opinion, flaky tests are a distraction and may consume contributors’ and fellows’ time, as failed tests in PRs can lead to potentially duplicate efforts in debugging and fixing them.
Lately I’ve seen a pattern of flaky tests for cache tests. There were a few occurrences regarding the filebased cache which I mitigated by proposing PR #17614. Recently I’ve seen two failures related to memcached backend (see [0] and [1], I’m including screenshots because the jenkins run logs get pruned).
For both of them, I think the issue is related to slow memcached and/or slow network access to operate/write and read from the cache. These of course are transient failures, but a distraction nevertheless. Specifically, these two pieces of test code:
494 cache.set("expire1", "very quickly", timeout=1)
495 self.assertIs(cache.touch("expire1", timeout=4), True)
and
635 cache.set("key5", "belgian fries", timeout=1)
636 self.assertIs(cache.touch("key5", timeout=None), True)
assume that setting a key in the cache and then doing a touch would take less than a second. My guess is that the failure is caused by having the whole execution of setting a key and the start of executing the touch taking a bit more than 1 second so the touch
operation fails (AFAIU touch
needs the key to be valid in the backend to touch it, so if it’s expired, it fails).
How to fix this? Well we could increase the first set
timeout to be 2 seconds to account for high latency/slower access (and adjust subsequent calls accordingly), but I dislike that approach because the cache tests will be yet slower than what they already are (there are plenty of sleep calls which makes that part of the suite noticeable slow).
Do you have any other ideas? Thank you!
Natalia.
[0]
[1]