Update the Trojan Source scanner for Unicode confusables/homoglyphs.

Review Request #11908 — Created Jan. 6, 2022 and submitted — Latest diff uploaded


Review Board


The Trojan Source scanner now looks for certain Unicode characters that
appear as standard latin1 characters, like A-Z, a-z, 0-9, etc. These can
be used by a malicious developer to try to sneak in logic that appears
to define or make use of a function, class, variable, etc. with one
name, while actually using a completely different name.

This is CVE-2021-42694.

This sort of scanning must be done carefully. There are a lot of
perfectly valid Unicode characters out there, and we don't want to check
them all, assume they're all nefarious.

What we instead do is check only confusables that meet the following

  1. Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l").
  2. Can be confused with a COMMON or LATIN Unicode character (ones most
    likely to be legitimately used in function names or other code)
  3. Is not itself a COMMON or LATIN Unicode character.

To generate the mapping, we have a new
./contrib/internal/build-confusables.py file, which will pull down the
latest datasets from unicode.org and generate a resulting
reviewboard/codesafety/_unicode_confusables.py file.

This is not perfect. People may find that some comments or strings
trigger a warning. Ideally, we'd be able to selectively apply these
tests depending on where it appears, but we're not in a position to do
that yet. Still, most of these should probably not be hit often in

Possible areas of future expansion would be to allow these if beside
other characters from the same script that are not themselves
confusables. This could be attempted if we get feedback later stating
that too many false-positives are being generated.

There is one major caveat to this implementation: it largely requires
wide Unicode character support, so that surrogate pairs appear as one
character/codepoint and not multiple.

This is always the case on Python 3. For Python 2, it depends on how
CPython was compiled. If wide support is not enabled, certain characters
cannot be found.

Unit tests pass on Python 2 (without wide support) and Python 3.

Tested with all the test code sets provided on