• 
      

    Work around cache backend flapping mitigation in HealthCheckView.

    Review Request #13318 — Created Oct. 9, 2023 and submitted

    Information

    Djblets
    release-4.x

    Reviewers

    Some cache backends (pymemcache, notably, but also possibly the newer
    Redis backend) employ server outage/flapping mitigation. When there are
    issues communicating with the server, the cache backend begins to filter
    out a certain number of failures, in case there's only a temporary
    glitch in talking to the server. After that time, errors will propagate,
    but after a certain amount of time, the backend will be reintroduced,
    allowing attempts to be made again.

    This is basically the Circuit Breaker pattern, and it's a good one, but
    the problem is, this is also what services utilizing healh check
    endpoints tend to do.

    While mitigation is happening in the cache backend, any writes are
    dropped and any reads return defaults. This leads HealthCheckView to
    think that its writes went through and its reads returned a suitable
    value. Depending on server load, health checks may end up noticing an
    outage, but the cache backend will usually reset state and fake results
    before the health checker hits its own max failure count.

    To work around this, we need to perform a test using two separate
    operations:

    1. Set a value in a key.
    2. Read the value out and make sure we don't get an uncacheable
      caller-provided default.

    If we get our default, we know the cache backend faked a read, and we
    can immediately notify the health checker that the cache backend is
    failing.

    Even if this is intermittent, and the cache backend would safely
    recover, it shouldn't pose any problems for the health checkers, which
    will just notice the good result during the next check.

    This makes a couple of other changes to help with results:

    1. If using Djblets's forwarding cache backend, we'd end up with two
      keys representing the same true cache backend. We now skip the
      internally-managed forwarded one.

    2. We were checking the Local Memory cache backends (used in production
      for things like static media serials), but there's no point to that.
      We now skip these.

    3. The checked key is now prefixed with "djblets-", to avoid overwriting
      some other managed key.

    Spent hours debugging why Review Board claimed to be healthy while a cache
    server was downed in a Docker setup.

    Dug into the cache backend and figured out its mitigation logic, and then
    thoroughly tested our workaround. Verified that the cache server's true
    state was always represented in the health check, and that after the
    required number of attempts, Docker marked the server as unhealthy and
    then restored as healthy once I brought the cache server back.

    Summary ID
    Work around cache backend flapping mitigation in HealthCheckView.
    Some cache backends (pymemcache, notably, but also possibly the newer Redis backend) employ server outage/flapping mitigation. When there are issues communicating with the server, the cache backend begins to filter out a certain number of failures, in case there's only a temporary glitch in talking to the server. After that time, errors will propagate, but after a certain amount of time, the backend will be reintroduced, allowing attempts to be made again. This is basically the Circuit Breaker pattern, and it's a good one, but the problem is, this is also what services utilizing healh check endpoints tend to do. While mitigation is happening in the cache backend, any writes are dropped and any reads return defaults. This leads `HealthCheckView` to think that its writes went through and its reads returned a suitable value. Depending on server load, health checks may end up noticing an outage, but the cache backend will usually reset state and fake results before the health checker hits its own max failure count. To work around this, we need to perform a test using two separate operations: 1. Set a value in a key. 2. Read the value out and make sure we don't get an uncacheable caller-provided default. If we get our default, we know the cache backend faked a read, and we can immediately notify the health checker that the cache backend is failing. Even if this is intermittent, and the cache backend would safely recover, it shouldn't pose any problems for the health checkers, which will just notice the good result during the next check. This makes a couple of other changes to help with results: 1. If using Djblets's forwarding cache backend, we'd end up with two keys representing the same true cache backend. We now skip the internally-managed forwarded one. 2. We were checking the Local Memory cache backends (used in production for things like static media serials), but there's no point to that. We now skip these. 3. The checked key is now prefixed with "djblets-", to avoid overwriting some other managed key.
    ebb2a829c747cc73177321bb11a4f30428b8f112
    david
    1. Ship It!
    2. 
        
    maubin
    1. Ship It!
    2. 
        
    chipx86
    Review request changed
    Status:
    Completed
    Change Summary:
    Pushed to release-4.x (de8a237)