Add safer cache key normalization, switch to SHA256, and fix return type.

Review Request #12376 — Created June 15, 2022 and submitted

Information

Djblets
release-3.x

Reviewers

make_cache_key() has been with us for a long time, and though it
generally worked as expected, there's been room for improvement.

This change starts by fixing up the normalization code for invalid
characters. Rather than stripping a few select characters (space, tab,
newlines), which wasn't sufficient, we now encode the hex values of any
invalid characters. According to Django's own modern validation code,
these are any characters in the ASCII range of 0-20, or ASCII 127 (DEL).
These become literal '\x00' through '\x20' and '\x7f'. The new
code uses regexes to do this, which should also be a bit faster.

The hash for the key now uses SHA256 instead of MD5, which is less
likely to generate collisions. This will of course invalidate any
existing keys, but that won't matter because of the final point.

We now return Unicode strings instead of byte strings. Turns out we made
a mistake when porting this to Python 3. We began returning byte
strings when we should have returned native strings. Django expects
Unicode strings in the case of Python 3. Behind the scenes, what ended
up happening was that a key like 'abc123' was turning into
b'abc123' in the cache. This could in theory result in keys that were
too big for the cache, when dealing with large cache keys + large data
split in cache_memoize_iter() (but in practice this hasn't happened).

Comprehensive unit tests have been added for make_cache_key().

Unit tests pass.

Summary ID
Add safer cache key normalization, switch to SHA256, and fix return type.
`make_cache_key()` has been with us for a long time, and though it generally worked as expected, there's been room for improvement. This change starts by fixing up the normalization code for invalid characters. Rather than stripping a few select characters (space, tab, newlines), which wasn't sufficient, we now encode the hex values of any invalid characters. According to Django's own modern validation code, these are any characters in the ASCII range of 0-20, or ASCII 127 (DEL). These become literal `'\x00'` through `'\x20'` and `'\x7f'`. The new code uses regexes to do this, which should also be a bit faster. The hash for the key now uses SHA256 instead of MD5, which is less likely to generate collisions. This will of course invalidate any existing keys, but that won't matter because of the final point. We now return Unicode strings instead of byte strings. Turns out we made a mistake when porting this to Python 3. We began returning byte strings when we should have returned native strings. Django expects Unicode strings in the case of Python 3. Behind the scenes, what ended up happening was that a key like `'abc123'` was turning into `b'abc123'` in the cache. This could in theory result in keys that were too big for the cache, when dealing with large cache keys + large data split in `cache_memoize_iter()` (but in practice this hasn't happened). Comprehensive unit tests have been added for `make_cache_key()`.
8fd4c1b28702b8aee9df7fc517fce293c7d97e88
chipx86
chipx86
david
  1. Ship It!
  2. 
      
chipx86
Review request changed
Status:
Completed
Change Summary:
Pushed to release-3.x (3aef61e)