Work around cache backend flapping mitigation in HealthCheckView.
Review Request #13318 — Created Oct. 9, 2023 and submitted
Some cache backends (pymemcache, notably, but also possibly the newer
Redis backend) employ server outage/flapping mitigation. When there are
issues communicating with the server, the cache backend begins to filter
out a certain number of failures, in case there's only a temporary
glitch in talking to the server. After that time, errors will propagate,
but after a certain amount of time, the backend will be reintroduced,
allowing attempts to be made again.
This is basically the Circuit Breaker pattern, and it's a good one, but
the problem is, this is also what services utilizing healh check
endpoints tend to do.
While mitigation is happening in the cache backend, any writes are
dropped and any reads return defaults. This leads
think that its writes went through and its reads returned a suitable
value. Depending on server load, health checks may end up noticing an
outage, but the cache backend will usually reset state and fake results
before the health checker hits its own max failure count.
To work around this, we need to perform a test using two separate
- Set a value in a key.
- Read the value out and make sure we don't get an uncacheable
If we get our default, we know the cache backend faked a read, and we
can immediately notify the health checker that the cache backend is
Even if this is intermittent, and the cache backend would safely
recover, it shouldn't pose any problems for the health checkers, which
will just notice the good result during the next check.
This makes a couple of other changes to help with results:
If using Djblets's forwarding cache backend, we'd end up with two
keys representing the same true cache backend. We now skip the
internally-managed forwarded one.
We were checking the Local Memory cache backends (used in production
for things like static media serials), but there's no point to that.
We now skip these.
The checked key is now prefixed with "djblets-", to avoid overwriting
some other managed key.
Spent hours debugging why Review Board claimed to be healthy while a cache
server was downed in a Docker setup.
Dug into the cache backend and figured out its mitigation logic, and then
thoroughly tested our workaround. Verified that the cache server's true
state was always represented in the health check, and that after the
required number of attempts, Docker marked the server as unhealthy and
then restored as healthy once I brought the cache server back.