• 
      

    Update the Trojan Source scanner for Unicode confusables/homoglyphs.

    Review Request #11908 — Created Jan. 6, 2022 and submitted

    Information

    Review Board
    release-5.0.x

    Reviewers

    The Trojan Source scanner now looks for certain Unicode characters that
    appear as standard latin1 characters, like A-Z, a-z, 0-9, etc. These can
    be used by a malicious developer to try to sneak in logic that appears
    to define or make use of a function, class, variable, etc. with one
    name, while actually using a completely different name.

    This is CVE-2021-42694.

    This sort of scanning must be done carefully. There are a lot of
    perfectly valid Unicode characters out there, and we don't want to check
    them all, assume they're all nefarious.

    What we instead do is check only confusables that meet the following
    criteria:

    1. Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l").
    2. Can be confused with a COMMON or LATIN Unicode character (ones most
      likely to be legitimately used in function names or other code)
    3. Is not itself a COMMON or LATIN Unicode character.

    To generate the mapping, we have a new
    ./contrib/internal/build-confusables.py file, which will pull down the
    latest datasets from unicode.org and generate a resulting
    reviewboard/codesafety/_unicode_confusables.py file.

    This is not perfect. People may find that some comments or strings
    trigger a warning. Ideally, we'd be able to selectively apply these
    tests depending on where it appears, but we're not in a position to do
    that yet. Still, most of these should probably not be hit often in
    practice.

    Possible areas of future expansion would be to allow these if beside
    other characters from the same script that are not themselves
    confusables. This could be attempted if we get feedback later stating
    that too many false-positives are being generated.

    There is one major caveat to this implementation: it largely requires
    wide Unicode character support, so that surrogate pairs appear as one
    character/codepoint and not multiple.

    This is always the case on Python 3. For Python 2, it depends on how
    CPython was compiled. If wide support is not enabled, certain characters
    cannot be found.

    Unit tests pass on Python 2 (without wide support) and Python 3.

    Tested with all the test code sets provided on
    https://github.com/nickboucher/trojan-source/

    Summary ID
    Update the Trojan Source scanner for Unicode confusables/homoglyphs.
    The Trojan Source scanner now looks for certain Unicode characters that appear as standard latin1 characters, like A-Z, a-z, 0-9, etc. These can be used by a malicious developer to try to sneak in logic that appears to define or make use of a function, class, variable, etc. with one name, while actually using a completely different name. This is CVE-2021-42694. This sort of scanning must be done carefully. There are a lot of perfectly valid Unicode characters out there, and we don't want to check them all, assume they're all nefarious. What we instead do is check only confusables that meet the following criteria: 1. Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l"). 2. Can be confused with a COMMON or LATIN Unicode character (ones most likely to be legitimately used in function names or other code) 3. Is not itself a COMMON or LATIN Unicode character. To generate the mapping, we have a new `./contrib/internal/build-confusables.py` file, which will pull down the latest datasets from unicode.org and generate a resulting `reviewboard/codesafety/_unicode_confusables.py` file. This is not perfect. People may find that some comments or strings trigger a warning. Ideally, we'd be able to selectively apply these tests depending on where it appears, but we're not in a position to do that yet. Still, most of these should probably not be hit often in practice. Possible areas of future expansion would be to allow these if beside other characters from the same script that are not themselves confusables. This could be attempted if we get feedback later stating that too many false-positives are being generated. There is one major caveat to this implementation: it largely requires wide Unicode character support, so that surrogate pairs appear as one character/codepoint and not multiple. This is always the case on Python 3. For Python 2, it depends on how CPython was compiled. If wide support is not enabled, certain characters cannot be found.
    84faa8cbdfda9a72b60281245b7c9cf0c53c4bb0

    Description From Last Updated

    F401 'django.utils.six.unichr' imported but unused

    reviewbotreviewbot

    Since this is running python3, can we not just use the new import paths for these?

    daviddavid

    Not necessary anymore.

    daviddavid

    Not necessary anymore.

    daviddavid

    Not necessary anymore.

    daviddavid
    Checks run (1 failed, 1 succeeded)
    flake8 failed.
    JSHint passed.

    flake8

    david
    1. 
        
    2. contrib/internal/build-confusables.py (Diff revision 1)
       
       
       
      Show all issues

      Since this is running python3, can we not just use the new import paths for these?

      1. Could. Originally it aimed to support both.

        For now, I'm going to keep these paths as-is, so that we don't break before getting a chance to display a useful error about Python compatibility.

    3. 
        
    chipx86
    david
    1. 
        
    2. contrib/internal/build-confusables.py (Diff revision 2)
       
       
      Show all issues

      Not necessary anymore.

    3. contrib/internal/build-confusables.py (Diff revision 2)
       
       
      Show all issues

      Not necessary anymore.

    4. Show all issues

      Not necessary anymore.

    5. 
        
    chipx86
    david
    1. Ship It!
    2. 
        
    maubin
    1. Ship It!
    2. 
        
    chipx86
    Review request changed
    Status:
    Completed
    Change Summary:
    Pushed to release-5.0.x (7e41ea8)