Summary

Update the Trojan Source scanner for Unicode confusables/homoglyphs.

Review Request #11908 — Created Jan. 6, 2022 and submitted June 16, 2022, 1:24 p.m.

Information

Owner

chipx86

Repository

Review Board

Branch

release-5.0.x

Bugs

Depends On

Reviewers

Groups

reviewboard

People

Description

This is CVE-2021-42694.

This sort of scanning must be done carefully. There are a lot of
perfectly valid Unicode characters out there, and we don't want to check
them all, assume they're all nefarious.

What we instead do is check only confusables that meet the following
criteria:

Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l").
Can be confused with a COMMON or LATIN Unicode character (ones most
likely to be legitimately used in function names or other code)
Is not itself a COMMON or LATIN Unicode character.

To generate the mapping, we have a new
./contrib/internal/build-confusables.py file, which will pull down the
latest datasets from unicode.org and generate a resulting
reviewboard/codesafety/_unicode_confusables.py file.

This is not perfect. People may find that some comments or strings
trigger a warning. Ideally, we'd be able to selectively apply these
tests depending on where it appears, but we're not in a position to do
that yet. Still, most of these should probably not be hit often in
practice.

Possible areas of future expansion would be to allow these if beside
other characters from the same script that are not themselves
confusables. This could be attempted if we get feedback later stating
that too many false-positives are being generated.

There is one major caveat to this implementation: it largely requires
wide Unicode character support, so that surrogate pairs appear as one
character/codepoint and not multiple.

This is always the case on Python 3. For Python 2, it depends on how
CPython was compiled. If wide support is not enabled, certain characters
cannot be found.

Testing Done

Unit tests pass on Python 2 (without wide support) and Python 3.

Tested with all the test code sets provided on
https://github.com/nickboucher/trojan-source/

Commits

Summary	ID
Update the Trojan Source scanner for Unicode confusables/homoglyphs. The Trojan Source scanner now looks for certain Unicode characters that appear as standard latin1 characters, like A-Z, a-z, 0-9, etc. These can be used by a malicious developer to try to sneak in logic that appears to define or make use of a function, class, variable, etc. with one name, while actually using a completely different name. This is CVE-2021-42694. This sort of scanning must be done carefully. There are a lot of perfectly valid Unicode characters out there, and we don't want to check them all, assume they're all nefarious. What we instead do is check only confusables that meet the following criteria: 1. Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l"). 2. Can be confused with a COMMON or LATIN Unicode character (ones most likely to be legitimately used in function names or other code) 3. Is not itself a COMMON or LATIN Unicode character. To generate the mapping, we have a new `./contrib/internal/build-confusables.py` file, which will pull down the latest datasets from unicode.org and generate a resulting `reviewboard/codesafety/_unicode_confusables.py` file. This is not perfect. People may find that some comments or strings trigger a warning. Ideally, we'd be able to selectively apply these tests depending on where it appears, but we're not in a position to do that yet. Still, most of these should probably not be hit often in practice. Possible areas of future expansion would be to allow these if beside other characters from the same script that are not themselves confusables. This could be attempted if we get feedback later stating that too many false-positives are being generated. There is one major caveat to this implementation: it largely requires wide Unicode character support, so that surrogate pairs appear as one character/codepoint and not multiple. This is always the case on Python 3. For Python 2, it depends on how CPython was compiled. If wide support is not enabled, certain characters cannot be found.	84faa8cbdfda9a72b60281245b7c9cf0c53c4bb0

Summary

Update the Trojan Source scanner for Unicode confusables/homoglyphs.

The Trojan Source scanner now looks for certain Unicode characters that appear as standard latin1 characters, like A-Z, a-z, 0-9, etc. These can be used by a malicious developer to try to sneak in logic that appears to define or make use of a function, class, variable, etc. with one name, while actually using a completely different name. This is CVE-2021-42694. This sort of scanning must be done carefully. There are a lot of perfectly valid Unicode characters out there, and we don't want to check them all, assume they're all nefarious. What we instead do is check only confusables that meet the following criteria: 1. Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l"). 2. Can be confused with a COMMON or LATIN Unicode character (ones most likely to be legitimately used in function names or other code) 3. Is not itself a COMMON or LATIN Unicode character. To generate the mapping, we have a new `./contrib/internal/build-confusables.py` file, which will pull down the latest datasets from unicode.org and generate a resulting `reviewboard/codesafety/_unicode_confusables.py` file. This is not perfect. People may find that some comments or strings trigger a warning. Ideally, we'd be able to selectively apply these tests depending on where it appears, but we're not in a position to do that yet. Still, most of these should probably not be hit often in practice. Possible areas of future expansion would be to allow these if beside other characters from the same script that are not themselves confusables. This could be attempted if we get feedback later stating that too many false-positives are being generated. There is one major caveat to this implementation: it largely requires wide Unicode character support, so that surrogate pairs appear as one character/codepoint and not multiple. This is always the case on Python 3. For Python 2, it depends on how CPython was compiled. If wide support is not enabled, certain characters cannot be found.

84faa8cbdfda9a72b60281245b7c9cf0c53c4bb0

Files

Issues

Description	From	Last Updated
F401 'django.utils.six.unichr' imported but unused	reviewbot	Jan. 6, 2022, 9:58 p.m.
Since this is running python3, can we not just use the new import paths for these?	david	May 23, 2022, 7:14 p.m.
Not necessary anymore.	david	May 26, 2022, 10:45 a.m.
Not necessary anymore.	david	May 26, 2022, 10:45 a.m.
Not necessary anymore.	david	May 26, 2022, 10:45 a.m.

flake8 failed.

JSHint passed.

flake8

contrib/internal/build-confusables.py (Diff revision 1)
The issue has been resolved. Show all issues
```
F401 'django.utils.six.unichr' imported but unused
```

Ship it!

contrib/internal/build-confusables.py (Diff revision 1)

The issue has been resolved. Show all issues

Since this is running python3, can we not just use the new import paths for these?

chipx86 Jan. 11, 2022, 4:43 p.m.

Could. Originally it aimed to support both.

For now, I'm going to keep these paths as-is, so that we don't break before getting a chance to display a useful error about Python compatibility.

Change Summary:

Updated for Review Board 5:

Removed Python 2-specific code, include the WIDE_UNICODE constant.
Removed six usage.
Removed __future__ imports.
Changed unicode to str in docstrings.
Changed SafeText to SafeString.
Updated versions in docstrings.
Updated build-confusables.py to do a better job building the output path and printing a result.

Commits:

	Summary	ID
	Update the Trojan Source scanner for Unicode confusables/homoglyphs. The Trojan Source scanner now looks for certain Unicode characters that appear as standard latin1 characters, like A-Z, a-z, 0-9, etc. These can be used by a malicious developer to try to sneak in logic that appears to define or make use of a function, class, variable, etc. with one name, while actually using a completely different name. This is CVE-2021-42694. This sort of scanning must be done carefully. There are a lot of perfectly valid Unicode characters out there, and we don't want to check them all, assume they're all nefarious. What we instead do is check only confusables that meet the following criteria: 1. Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l"). 2. Can be confused with a COMMON or LATIN Unicode character (ones most likely to be legitimately used in function names or other code) 3. Is not itself a COMMON or LATIN Unicode character. To generate the mapping, we have a new `./contrib/internal/build-confusables.py` file, which will pull down the latest datasets from unicode.org and generate a resulting `reviewboard/codesafety/_unicode_confusables.py` file. This is not perfect. People may find that some comments or strings trigger a warning. Ideally, we'd be able to selectively apply these tests depending on where it appears, but we're not in a position to do that yet. Still, most of these should probably not be hit often in practice. Possible areas of future expansion would be to allow these if beside other characters from the same script that are not themselves confusables. This could be attempted if we get feedback later stating that too many false-positives are being generated. There is one major caveat to this implementation: it largely requires wide Unicode character support, so that surrogate pairs appear as one character/codepoint and not multiple. This is always the case on Python 3. For Python 2, it depends on how CPython was compiled. If wide support is not enabled, certain characters cannot be found.	21bc54a697a84aa01c2817a5631766e3a9436f2a
	Update the Trojan Source scanner for Unicode confusables/homoglyphs. The Trojan Source scanner now looks for certain Unicode characters that appear as standard latin1 characters, like A-Z, a-z, 0-9, etc. These can be used by a malicious developer to try to sneak in logic that appears to define or make use of a function, class, variable, etc. with one name, while actually using a completely different name. This is CVE-2021-42694. This sort of scanning must be done carefully. There are a lot of perfectly valid Unicode characters out there, and we don't want to check them all, assume they're all nefarious. What we instead do is check only confusables that meet the following criteria: 1. Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l"). 2. Can be confused with a COMMON or LATIN Unicode character (ones most likely to be legitimately used in function names or other code) 3. Is not itself a COMMON or LATIN Unicode character. To generate the mapping, we have a new `./contrib/internal/build-confusables.py` file, which will pull down the latest datasets from unicode.org and generate a resulting `reviewboard/codesafety/_unicode_confusables.py` file. This is not perfect. People may find that some comments or strings trigger a warning. Ideally, we'd be able to selectively apply these tests depending on where it appears, but we're not in a position to do that yet. Still, most of these should probably not be hit often in practice. Possible areas of future expansion would be to allow these if beside other characters from the same script that are not themselves confusables. This could be attempted if we get feedback later stating that too many false-positives are being generated. There is one major caveat to this implementation: it largely requires wide Unicode character support, so that surrogate pairs appear as one character/codepoint and not multiple. This is always the case on Python 3. For Python 2, it depends on how CPython was compiled. If wide support is not enabled, certain characters cannot be found.	702c66b09851e56ab4ac3f78ad9950de53556243

Branch:

release-4.0.x

release-5.0.x

Diff:

Revision 2 (+4142 -20)

Show changes

	contrib/internal/build-confusables.py
	reviewboard/codesafety/_unicode_confusables.py
	reviewboard/codesafety/checkers/trojan_source.py
	reviewboard/codesafety/tests/test_trojan_source_code_safety_checker.py
	reviewboard/templates/codesafety/trojan_source_alert.html

Checks run (2 succeeded)

flake8 passed.

JSHint passed.

contrib/internal/build-confusables.py (Diff revision 2)
The issue has been resolved. Show all issues
```
Not necessary anymore.
```
contrib/internal/build-confusables.py (Diff revision 2)
The issue has been resolved. Show all issues
```
Not necessary anymore.
```
reviewboard/codesafety/_unicode_confusables.py (Diff revision 2)
The issue has been resolved. Show all issues
```
Not necessary anymore.
```

Change Summary:

Removed __future__ imports.

Commits:

	Summary	ID
	Update the Trojan Source scanner for Unicode confusables/homoglyphs. The Trojan Source scanner now looks for certain Unicode characters that appear as standard latin1 characters, like A-Z, a-z, 0-9, etc. These can be used by a malicious developer to try to sneak in logic that appears to define or make use of a function, class, variable, etc. with one name, while actually using a completely different name. This is CVE-2021-42694. This sort of scanning must be done carefully. There are a lot of perfectly valid Unicode characters out there, and we don't want to check them all, assume they're all nefarious. What we instead do is check only confusables that meet the following criteria: 1. Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l"). 2. Can be confused with a COMMON or LATIN Unicode character (ones most likely to be legitimately used in function names or other code) 3. Is not itself a COMMON or LATIN Unicode character. To generate the mapping, we have a new `./contrib/internal/build-confusables.py` file, which will pull down the latest datasets from unicode.org and generate a resulting `reviewboard/codesafety/_unicode_confusables.py` file. This is not perfect. People may find that some comments or strings trigger a warning. Ideally, we'd be able to selectively apply these tests depending on where it appears, but we're not in a position to do that yet. Still, most of these should probably not be hit often in practice. Possible areas of future expansion would be to allow these if beside other characters from the same script that are not themselves confusables. This could be attempted if we get feedback later stating that too many false-positives are being generated. There is one major caveat to this implementation: it largely requires wide Unicode character support, so that surrogate pairs appear as one character/codepoint and not multiple. This is always the case on Python 3. For Python 2, it depends on how CPython was compiled. If wide support is not enabled, certain characters cannot be found.	702c66b09851e56ab4ac3f78ad9950de53556243
	Update the Trojan Source scanner for Unicode confusables/homoglyphs. The Trojan Source scanner now looks for certain Unicode characters that appear as standard latin1 characters, like A-Z, a-z, 0-9, etc. These can be used by a malicious developer to try to sneak in logic that appears to define or make use of a function, class, variable, etc. with one name, while actually using a completely different name. This is CVE-2021-42694. This sort of scanning must be done carefully. There are a lot of perfectly valid Unicode characters out there, and we don't want to check them all, assume they're all nefarious. What we instead do is check only confusables that meet the following criteria: 1. Has a codepoint >= 128 (avoiding issues with, say, "1" vs" "l"). 2. Can be confused with a COMMON or LATIN Unicode character (ones most likely to be legitimately used in function names or other code) 3. Is not itself a COMMON or LATIN Unicode character. To generate the mapping, we have a new `./contrib/internal/build-confusables.py` file, which will pull down the latest datasets from unicode.org and generate a resulting `reviewboard/codesafety/_unicode_confusables.py` file. This is not perfect. People may find that some comments or strings trigger a warning. Ideally, we'd be able to selectively apply these tests depending on where it appears, but we're not in a position to do that yet. Still, most of these should probably not be hit often in practice. Possible areas of future expansion would be to allow these if beside other characters from the same script that are not themselves confusables. This could be attempted if we get feedback later stating that too many false-positives are being generated. There is one major caveat to this implementation: it largely requires wide Unicode character support, so that surrogate pairs appear as one character/codepoint and not multiple. This is always the case on Python 3. For Python 2, it depends on how CPython was compiled. If wide support is not enabled, certain characters cannot be found.	84faa8cbdfda9a72b60281245b7c9cf0c53c4bb0

Diff:

Revision 3 (+4126 -20)

Show changes

	contrib/internal/build-confusables.py
	reviewboard/codesafety/_unicode_confusables.py
	reviewboard/codesafety/checkers/trojan_source.py
	reviewboard/codesafety/tests/test_trojan_source_code_safety_checker.py
	reviewboard/templates/codesafety/trojan_source_alert.html

Checks run (2 succeeded)

flake8 passed.

JSHint passed.

Ship it!

```
Ship It!
```

Ship it!

```
Ship It!
```

Status:: Completed
Change Summary:: Pushed to release-5.0.x (7e41ea8)