Add a Credential Checker tool for Review Bot

Review Request #11206 — Created Sept. 27, 2020 and updated

ceciliawei
ReviewBot
master
reviewbot, students
jace, jblazusi

This change adds a tool for Review Bot to check for hard-coded security
credentials. I used regex to search on the patterns of the known
credential types. The result may include false positives.

Verified the functionality by installing the tool and manual
testing.

Summary Author
Add a Credential Checker tool for Review Bot.
ceciliaccwei
Loading file attachments...

Description From Last Updated

Hmm, so I know in the project description we called it "AWS Credential Checker" but I wonder if we shouldn't ...

daviddavid

In your description, you say it "adds an integration". The lingo we use in review bot is "adds a tool"

daviddavid

Filenames should always be lowercase and consistent with files in the same directory. We don't have any others with capitalization ...

chipx86chipx86

This docstring almost reads as if "AWS Credential Checker" is a third-party tool that can be run. Let's say something ...

daviddavid

For imports, we have a standard based on Python's style guide. Imports should be broken up into 4 possible sections ...

daviddavid

Same here as the module docstring.

daviddavid

There's nothing in this that requires using the raw string (r prefix). This should also use single-quotes instead of double.

daviddavid

We're compiling the regex once per file, but of course it will never change. Can we define an __init__ method ...

daviddavid

We can avoid writing the file out to disk and reading it back in. Let's do something like: data = ...

daviddavid

I don't think we need to list the actual contents of the keys in the comment message, since it will ...

daviddavid

It seems like we could probably just do this inside the loop that is doing all the matches.

daviddavid

It'd be nice to include what services this checks for.

chipx86chipx86

This is missing a docstring.

chipx86chipx86

We always want to use super to call parent methods.

chipx86chipx86

I have some notes for improving this. 1. Raw strings Let's prefix each string with r, which is for a ...

chipx86chipx86

Is there anything more specific and easy to match than this? It feels pretty generic.

chipx86chipx86

There shouldn't be ablank line after function/method-level docstrings. Just file/class.

chipx86chipx86

Here's an interesting thing. We cannot assume UTF-8 here. Plenty of files are in other encodings. So what we should ...

chipx86chipx86

Every time we call self.pattern.search(line), we're re-fetching pattern from self. For a 5,000 line file, that means 5,000 attribute accesses. ...

chipx86chipx86

There should always be a blank line between statements (like match = ...) and blocks (if ...).

chipx86chipx86

When formatting multi-line strings, we prefer using parens and having the space at the end of a line instead of ...

chipx86chipx86

There should be a blank line between the statement and block.

chipx86chipx86

We should log if this fails. To do this, at the top, import logging and do: logger = logging.getLogger(__name__) This ...

chipx86chipx86

The name doesn't reflect the conventions used elsewhere. We may also want to distinguish it from other tools by prefixing ...

chipx86chipx86

E999 SyntaxError: invalid syntax

reviewbotreviewbot

E999 SyntaxError: invalid syntax

reviewbotreviewbot

Instead of "if a credential has been committed", perhaps "check for hard-coded security credentials"? Same in the docstrings throughout this.

daviddavid

E999 SyntaxError: invalid syntax

reviewbotreviewbot

Hmm, I'm not sure about this comment text. Perhaps something like "This appears to be a hard-coded credential, which is ...

daviddavid

I don't think any of the other tools wrap f.comment in an exception handler. Were you encountering some problem?

daviddavid

This text can wrap to a wider width (80 columns). Also same comment about the description text as in the ...

daviddavid

E999 SyntaxError: invalid syntax

reviewbotreviewbot

Exclamation points are rarely if ever appropriate for use in UI text, even for security notices. Let's just use a ...

daviddavid

Even though it's > 80 columns, let's put this all on one line. Wrapping like this is hard to read ...

daviddavid

Can we add a blank line between these two?

daviddavid

E999 SyntaxError: invalid syntax

reviewbotreviewbot

E501 line too long (94 > 79 characters)

reviewbotreviewbot
david
  1. This is a great start!

  2. Hmm, so I know in the project description we called it "AWS Credential Checker" but I wonder if we shouldn't make this more generic. It would be nice to be able to add on other common credential types (SSH keys, etc.)

    1. I actually had that as a potential task of the tool which I thought about discussing later this week.

      My thoughts are the following:
      1. Rename the tool as Secret Scanner or something similar so that it can be a generic credential checker.
      2. When the admin installs the tool, a list of checkboxes of different secret types will be presented. Then he can decides which ones to check.
      (We can also scan for every type of secret, but I was not sure if we need to.)

  3. In your description, you say it "adds an integration". The lingo we use in review bot is "adds a tool"

    1. Will update that in revision.

  4. This docstring almost reads as if "AWS Credential Checker" is a third-party tool that can be run. Let's say something like "Review Bot tool to check for hardcoded credentials"

    1. Will update that in revision.

  5. bot/reviewbot/tools/AWS_Credential_Checker.py (Diff revision 1)
     
     
     
     
     

    For imports, we have a standard based on Python's style guide. Imports should be broken up into 4 possible sections separated by newlines.

    imports from __future__
    
    standard library imports (that would be your re import)
    
    imports from third-party libraries not in the standard library. This can include imports from other packages that we maintain, such as rbtools.
    
    imports from "this package" (meaning anything in reviewbot)
    

    Within that, we do any full-package imports first (import ...), followed by any from imports. All of that should be alphabetized. What you have is simple enough that most of those rules don't apply, but the sections need to be reorganized as such:

    from __future__ import unicode_literals
    
    import re
    
    from reviewbot.tools import Tool
    
    1. Thanks for letting me know. I will keep that in mind.

  6. Same here as the module docstring.

  7. There's nothing in this that requires using the raw string (r prefix). This should also use single-quotes instead of double.

  8. We're compiling the regex once per file, but of course it will never change. Can we define an __init__ method for the tool, and store the compiled regex as an attribute?

    1. The Tool class already has an init method, if we add an init method in the AWS_Credential_Checker tool, would the previous one get overriden?

  9. bot/reviewbot/tools/AWS_Credential_Checker.py (Diff revision 1)
     
     
     
     

    We can avoid writing the file out to disk and reading it back in. Let's do something like:

    data = f.patched_file_contents
    
    if not data:
      return
    
    for i, line in enumerate(data.splitlines()):
        ...
    
    1. Oh I see what you mean now. Sorry I mixed up the concept of "patched_file_contents" with "patched_contents". Will address in revision.

  10. bot/reviewbot/tools/AWS_Credential_Checker.py (Diff revision 1)
     
     
     
     

    I don't think we need to list the actual contents of the keys in the comment message, since it will be associated with the line and show up with the context. That lets us simplify the message to something warning that they're potentially hard-coding secret credentials in their code.

    In the future, one thing to be really careful about is the use of str in Python code. On Python 2, str (aka bytes) is a bytestring and unicode is a unicode string. On Python 3, str is unicode and bytes is bytes, which makes a lot more sense. Any time there's a potential mix of string types we need to be really careful, especially since Python 3 treats such mixing as an error but Python 2 will happily attempt to cast silently.

    1. Makes sense. Will address in revision. Thanks!

  11. If we make the above change to use patched_file_contents then this point is moot, but one cool thing about Python is you can use files as a context manager:

    with open(path) as f:
        ... do stuff with f...
    

    When the code exits the context of that with block, it will automatically close the file. This happens even if there's an exception that occurs within the block, and helps prevent open file descriptor leaks.

    1. I see. Thanks for the feedback, I will keep that in mind. And I will use patched_file_contents instead of reading the actual file.

  12. bot/reviewbot/tools/AWS_Credential_Checker.py (Diff revision 1)
     
     
     
     
     
     

    It seems like we could probably just do this inside the loop that is doing all the matches.

    1. Right. I was referring to the other tools but since they are all executing a certain command and analyzing the output, there's no way to do so. I fogot to take that into account. Will update in revision.

  13. 
      
ceciliawei
ceciliawei
ceciliawei
ceciliawei
jblazusi
  1. 
      
    1. Everything looks good to me!

  2. 
      
jblazusi
  1. 
      
  2. 
      
chipx86
  1. 
      
  2. Filenames should always be lowercase and consistent with files in the same directory. We don't have any others with capitalization or underscores.

    Let's call this one rbcredentialcheck.py.

  3. It'd be nice to include what services this checks for.

  4. This is missing a docstring.

  5. We always want to use super to call parent methods.

  6. bot/reviewbot/tools/Credential_Checker.py (Diff revision 3)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     

    I have some notes for improving this.

    1. Raw strings

    Let's prefix each string with r, which is for a raw string. This allows suff like \n to literally match the text \n, not a newline. It's often handy for regexes, and helps avoid mistakes in the future.

    2. Alphabetizing

    Let's alphabetize the keys, so it's more clear where any new ones should go.

    3. Performance

    Currently, every time the tool is initialized, we reconstruct the pattern from the strings in a dictionary. I think there are two things we should do:

    1. Check for and store the pattern on the class during initialization (so we only construct it once per process) — this is probably best moved to __init__ in the process.
    2. Build this as one big regex.

    The latter would look like:

    pattern = re.compile(
        r'''(
    
        # AWS Access Key
        ((A3T[A-Z0-9]|AKIA|AGPA|AIDA|AROA|AIPA|ANPA|ANVA|ASIA)[A-Z0-9]{16})
        |
    
        # AWS Secret Key
        ([aA][wW][sS].*[0-9a-zA-Z/+]{40})
        |
    
        ..etc..
        )''', re.X)
    

    This is faster and allows expressions to be more free-form and have room to breathe.

    4. Alphabetization

    I'm tempted to say we shouldn't trust that the content is going to always be in the same casing. We don't want to miss something just because someone pasted a SSH key that had bEGIN RSA PRIVATE KEY.

    Ideally we'd use Python 3.7+'s ability to mark subsections of regexes as case-insensitive, but we can't. We could just require it for the whole string by using the re.I option on re.compile().

    Interested in hearing your thoughts on that.

    1. Thanks for the feedback! I have a few questions though -
      For 1, before I used the patterns as pattern = r"AKIA[0-9A-Z]{16}|[0-9a-zA-Z/+]{40}" and David suggested that there's nothing in the pattern that requires using the raw string (r prefix). I re-read the usage of the r prefix and I agree that nothing in those patterns requires using it. I don't think there's any \n in the patterns for now. Is this for adding more patterns in the future? i.e. we might need to add more patterns later and some of them might contain the '\n' charater. So it would be better to keep these consistent?

      For 3.1, do you mind clarifyng on this? I thought putting the self.checker_init() in the __init__ addressed this. I was thinking store the compiled regex as part of the __init__ so the patterns only need to be compiled once during initilization.

      For 4, for example I currently have [aA][wW][sS].*[0-9a-zA-Z/+]{40}, if we were to use re.I, then should I just change this to aws.*[0-9a-z/+]{40} since the match is case insensitive?

  7. Is there anything more specific and easy to match than this? It feels pretty generic.

    1. I didn't find a more reliable source but that's what is used by truffleHog and gitleaks.

  8. bot/reviewbot/tools/Credential_Checker.py (Diff revision 3)
     
     
     
     

    There shouldn't be ablank line after function/method-level docstrings. Just file/class.

  9. Here's an interesting thing. We cannot assume UTF-8 here. Plenty of files are in other encodings.

    So what we should do is leave this as a byte string, and have the compiled regex also be a byte string. Then we don't have to think about encodings. This is also faster.

    1. Got it! Will address in the revision.

  10. Every time we call self.pattern.search(line), we're re-fetching pattern from self. For a 5,000 line file, that means 5,000 attribute accesses.

    Instead, let's pull pattern out of self before the loop, and access it directly as a local variable.

  11. bot/reviewbot/tools/Credential_Checker.py (Diff revision 3)
     
     
     

    There should always be a blank line between statements (like match = ...) and blocks (if ...).

  12. bot/reviewbot/tools/Credential_Checker.py (Diff revision 3)
     
     
     

    When formatting multi-line strings, we prefer using parens and having the space at the end of a line instead of the beginning.

    However, this can just be provided inline to f.comment().

  13. bot/reviewbot/tools/Credential_Checker.py (Diff revision 3)
     
     
     

    There should be a blank line between the statement and block.

  14. bot/reviewbot/tools/Credential_Checker.py (Diff revision 3)
     
     
     

    We should log if this fails. To do this, at the top, import logging and do:

    logger = logging.getLogger(__name__)
    

    This gives us a logger that will namespace any messages, making it clear which tool this is coming from. You can then call logger.exception(...) with suitable information.

    1. Will do. I was referring to flake8 but looks like other tools all have this logging. I will have a review request to log for flake8's exception as well.

  15. bot/setup.py (Diff revision 3)
     
     
     

    The name doesn't reflect the conventions used elsewhere. We may also want to distinguish it from other tools by prefixing it. How about rbcredentialcheck?

  16. 
      
ceciliawei
Review request changed

Change Summary:

Addressed feedback.

Commits:

Summary Author
-
Add a Credential Checker tool for Review Bot
ceciliaccwei
+
Add a Credential Checker tool for Review Bot
ceciliaccwei

Diff:

Revision 4 (+238)

Show changes

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

ceciliawei
Review request changed

Change Summary:

flake8

Commits:

Summary Author
-
Add a Credential Checker tool for Review Bot
ceciliaccwei
+
Add a Credential Checker tool for Review Bot
ceciliaccwei

Diff:

Revision 5 (+240)

Show changes

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

ceciliawei
  1. 
      
  2. 
      
ceciliawei
  1. 
      
  2. 
      
ceciliawei
Review request changed

Change Summary:

Added tool documentation.

Commits:

Summary Author
-
Add a Credential Checker tool for Review Bot
ceciliaccwei
+
Add a Credential Checker tool for Review Bot
ceciliaccwei

Diff:

Revision 6 (+306)

Show changes

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

david
  1. Looking pretty great!

  2. bot/reviewbot/tools/rbcredentialchecker.py (Diff revision 6)
     
     
     

    Instead of "if a credential has been committed", perhaps "check for hard-coded security credentials"?

    Same in the docstrings throughout this.

  3. bot/reviewbot/tools/rbcredentialchecker.py (Diff revision 6)
     
     
     

    Hmm, I'm not sure about this comment text. Perhaps something like "This appears to be a hard-coded credential, which is a potential security risk" ?

  4. bot/reviewbot/tools/rbcredentialchecker.py (Diff revision 6)
     
     
     
     

    I don't think any of the other tools wrap f.comment in an exception handler. Were you encountering some problem?

    1. Before I had another line inside the try block which could cause exceptions. I forgot to remove the try block after removing that.

      The feedback was addressed in the revision.

  5. docs/reviewbot/tools/rbcredentialchecker.rst (Diff revision 6)
     
     
     
     
     
     
     
     
     
     
     
     

    This text can wrap to a wider width (80 columns). Also same comment about the description text as in the credential checker source.

  6. 
      
ceciliawei
Review request changed

Change Summary:

Addressed feedback.

Description:

~  

This change adds a tool for Review Bot to detect if a secret/credential

~   is included in the commit. I used regex to search on the patterns of the
~   known credential types. The result may include false positives.

  ~

This change adds a tool for Review Bot to check for hard-coded security

  ~ credentials. I used regex to search on the patterns of the known
  ~ credential types. The result may include false positives.

Commits:

Summary Author
-
Add a Credential Checker tool for Review Bot
ceciliaccwei
+
Add a Credential Checker tool for Review Bot
ceciliaccwei

Diff:

Revision 7 (+290)

Show changes

Added Files:

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

david
  1. 
      
  2. Exclamation points are rarely if ever appropriate for use in UI text, even for security notices. Let's just use a period.

  3. bot/setup.py (Diff revision 7)
     
     
     

    Even though it's > 80 columns, let's put this all on one line. Wrapping like this is hard to read and error-prone.

  4. Can we add a blank line between these two?

    1. Addressed in revision.

  5. 
      
ceciliawei
Review request changed

Change Summary:

Addressed feedback

Commits:

Summary Author
-
Add a Credential Checker tool for Review Bot
ceciliaccwei
+
Add a Credential Checker tool for Review Bot.
ceciliaccwei

Diff:

Revision 8 (+290)

Show changes

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

ceciliawei
Review request changed

Change Summary:

Updated screenshot

Added Files:

Loading...