This change has been marked as completed.

Describe the completed change (optional):

Pushed to release-4.x (de8a237)

Summary

Work around cache backend flapping mitigation in HealthCheckView.

Review Request #13318 — Created Oct. 9, 2023 and submitted 1 year, 9 months ago

Information

Owner

chipx86*

Repository

Djblets

Branch

release-4.x

Bugs

Depends On

Reviewers

Groups

djblets

People

Description*

This is basically the Circuit Breaker pattern, and it's a good one, but
the problem is, this is also what services utilizing healh check
endpoints tend to do.

While mitigation is happening in the cache backend, any writes are
dropped and any reads return defaults. This leads HealthCheckView to
think that its writes went through and its reads returned a suitable
value. Depending on server load, health checks may end up noticing an
outage, but the cache backend will usually reset state and fake results
before the health checker hits its own max failure count.

To work around this, we need to perform a test using two separate
operations:

Set a value in a key.
Read the value out and make sure we don't get an uncacheable
caller-provided default.

If we get our default, we know the cache backend faked a read, and we
can immediately notify the health checker that the cache backend is
failing.

Even if this is intermittent, and the cache backend would safely
recover, it shouldn't pose any problems for the health checkers, which
will just notice the good result during the next check.

This makes a couple of other changes to help with results:

If using Djblets's forwarding cache backend, we'd end up with two
keys representing the same true cache backend. We now skip the
internally-managed forwarded one.
We were checking the Local Memory cache backends (used in production
for things like static media serials), but there's no point to that.
We now skip these.
The checked key is now prefixed with "djblets-", to avoid overwriting
some other managed key.

Testing Done

Spent hours debugging why Review Board claimed to be healthy while a cache
server was downed in a Docker setup.

Dug into the cache backend and figured out its mitigation logic, and then
thoroughly tested our workaround. Verified that the cache server's true
state was always represented in the health check, and that after the
required number of attempts, Docker marked the server as unhealthy and
then restored as healthy once I brought the cache server back.

Commits

Summary	ID
Work around cache backend flapping mitigation in HealthCheckView. Some cache backends (pymemcache, notably, but also possibly the newer Redis backend) employ server outage/flapping mitigation. When there are issues communicating with the server, the cache backend begins to filter out a certain number of failures, in case there's only a temporary glitch in talking to the server. After that time, errors will propagate, but after a certain amount of time, the backend will be reintroduced, allowing attempts to be made again. This is basically the Circuit Breaker pattern, and it's a good one, but the problem is, this is also what services utilizing healh check endpoints tend to do. While mitigation is happening in the cache backend, any writes are dropped and any reads return defaults. This leads `HealthCheckView` to think that its writes went through and its reads returned a suitable value. Depending on server load, health checks may end up noticing an outage, but the cache backend will usually reset state and fake results before the health checker hits its own max failure count. To work around this, we need to perform a test using two separate operations: 1. Set a value in a key. 2. Read the value out and make sure we don't get an uncacheable caller-provided default. If we get our default, we know the cache backend faked a read, and we can immediately notify the health checker that the cache backend is failing. Even if this is intermittent, and the cache backend would safely recover, it shouldn't pose any problems for the health checkers, which will just notice the good result during the next check. This makes a couple of other changes to help with results: 1. If using Djblets's forwarding cache backend, we'd end up with two keys representing the same true cache backend. We now skip the internally-managed forwarded one. 2. We were checking the Local Memory cache backends (used in production for things like static media serials), but there's no point to that. We now skip these. 3. The checked key is now prefixed with "djblets-", to avoid overwriting some other managed key.	ebb2a829c747cc73177321bb11a4f30428b8f112

Summary

Work around cache backend flapping mitigation in HealthCheckView.

Some cache backends (pymemcache, notably, but also possibly the newer Redis backend) employ server outage/flapping mitigation. When there are issues communicating with the server, the cache backend begins to filter out a certain number of failures, in case there's only a temporary glitch in talking to the server. After that time, errors will propagate, but after a certain amount of time, the backend will be reintroduced, allowing attempts to be made again. This is basically the Circuit Breaker pattern, and it's a good one, but the problem is, this is also what services utilizing healh check endpoints tend to do. While mitigation is happening in the cache backend, any writes are dropped and any reads return defaults. This leads `HealthCheckView` to think that its writes went through and its reads returned a suitable value. Depending on server load, health checks may end up noticing an outage, but the cache backend will usually reset state and fake results before the health checker hits its own max failure count. To work around this, we need to perform a test using two separate operations: 1. Set a value in a key. 2. Read the value out and make sure we don't get an uncacheable caller-provided default. If we get our default, we know the cache backend faked a read, and we can immediately notify the health checker that the cache backend is failing. Even if this is intermittent, and the cache backend would safely recover, it shouldn't pose any problems for the health checkers, which will just notice the good result during the next check. This makes a couple of other changes to help with results: 1. If using Djblets's forwarding cache backend, we'd end up with two keys representing the same true cache backend. We now skip the internally-managed forwarded one. 2. We were checking the Local Memory cache backends (used in production for things like static media serials), but there's no point to that. We now skip these. 3. The checked key is now prefixed with "djblets-", to avoid overwriting some other managed key.

ebb2a829c747cc73177321bb11a4f30428b8f112

Description	From	Last Updated
There are no open issues

flake8 passed.

JSHint passed.

Ship it!

```
Ship It!
```

Ship it!

```
Ship It!
```

Status:: Completed
Change Summary:: Pushed to release-4.x (de8a237)