Work around cache backend flapping mitigation in HealthCheckView.

Review Request #13318 — Created Oct. 9, 2023 and submitted

Information

Djblets
release-4.x

Reviewers

Some cache backends (pymemcache, notably, but also possibly the newer
Redis backend) employ server outage/flapping mitigation. When there are
issues communicating with the server, the cache backend begins to filter
out a certain number of failures, in case there's only a temporary
glitch in talking to the server. After that time, errors will propagate,
but after a certain amount of time, the backend will be reintroduced,
allowing attempts to be made again.

This is basically the Circuit Breaker pattern, and it's a good one, but
the problem is, this is also what services utilizing healh check
endpoints tend to do.

While mitigation is happening in the cache backend, any writes are
dropped and any reads return defaults. This leads HealthCheckView to
think that its writes went through and its reads returned a suitable
value. Depending on server load, health checks may end up noticing an
outage, but the cache backend will usually reset state and fake results
before the health checker hits its own max failure count.

To work around this, we need to perform a test using two separate
operations:

  1. Set a value in a key.
  2. Read the value out and make sure we don't get an uncacheable
    caller-provided default.

If we get our default, we know the cache backend faked a read, and we
can immediately notify the health checker that the cache backend is
failing.

Even if this is intermittent, and the cache backend would safely
recover, it shouldn't pose any problems for the health checkers, which
will just notice the good result during the next check.

This makes a couple of other changes to help with results:

  1. If using Djblets's forwarding cache backend, we'd end up with two
    keys representing the same true cache backend. We now skip the
    internally-managed forwarded one.

  2. We were checking the Local Memory cache backends (used in production
    for things like static media serials), but there's no point to that.
    We now skip these.

  3. The checked key is now prefixed with "djblets-", to avoid overwriting
    some other managed key.

Spent hours debugging why Review Board claimed to be healthy while a cache
server was downed in a Docker setup.

Dug into the cache backend and figured out its mitigation logic, and then
thoroughly tested our workaround. Verified that the cache server's true
state was always represented in the health check, and that after the
required number of attempts, Docker marked the server as unhealthy and
then restored as healthy once I brought the cache server back.

Summary ID
Work around cache backend flapping mitigation in HealthCheckView.
Some cache backends (pymemcache, notably, but also possibly the newer Redis backend) employ server outage/flapping mitigation. When there are issues communicating with the server, the cache backend begins to filter out a certain number of failures, in case there's only a temporary glitch in talking to the server. After that time, errors will propagate, but after a certain amount of time, the backend will be reintroduced, allowing attempts to be made again. This is basically the Circuit Breaker pattern, and it's a good one, but the problem is, this is also what services utilizing healh check endpoints tend to do. While mitigation is happening in the cache backend, any writes are dropped and any reads return defaults. This leads `HealthCheckView` to think that its writes went through and its reads returned a suitable value. Depending on server load, health checks may end up noticing an outage, but the cache backend will usually reset state and fake results before the health checker hits its own max failure count. To work around this, we need to perform a test using two separate operations: 1. Set a value in a key. 2. Read the value out and make sure we don't get an uncacheable caller-provided default. If we get our default, we know the cache backend faked a read, and we can immediately notify the health checker that the cache backend is failing. Even if this is intermittent, and the cache backend would safely recover, it shouldn't pose any problems for the health checkers, which will just notice the good result during the next check. This makes a couple of other changes to help with results: 1. If using Djblets's forwarding cache backend, we'd end up with two keys representing the same true cache backend. We now skip the internally-managed forwarded one. 2. We were checking the Local Memory cache backends (used in production for things like static media serials), but there's no point to that. We now skip these. 3. The checked key is now prefixed with "djblets-", to avoid overwriting some other managed key.
ebb2a829c747cc73177321bb11a4f30428b8f112
david
  1. Ship It!
  2. 
      
maubin
  1. Ship It!
  2. 
      
chipx86
Review request changed
Status:
Completed
Change Summary:
Pushed to release-4.x (de8a237)