• 
      

    Normalize, encode, and decode cert hostnames for storage.

    Review Request #15016 — Created April 16, 2026 and submitted

    Information

    Review Board
    release-8.x

    Reviewers

    When a certificate represents a hostname, or a client requests one, that
    hostname may be presented in any casing (uppercase, lowercase, mixed
    case), which can pose issues for comparison. Further, when dealing with
    filesystem storage, we may encounter hostnames with non-ASCII characters
    in them, which may pose challenges depending on the filesystem.

    This change introduces casing normalization of hostnames in the storage
    objects to ease comparisons, and normalization/encoding/decoding in the
    file storage backend to handle encoding and representation differences.

    The base storage objects that deal with hostnames now keep a version of
    the hostname normalized for comparison purposes. This is a Unicode
    string that can resolve to a hostname, but with casing converted to
    lowercase. This eases comparison and gives a consistent representation
    of these hostnames.

    The file storage backend handles its own normalization and translation
    behavior when computing filenames for a given hostname. Encoding
    involves removing any trailing period on the hostname and then
    then converting to an IDNA 2008 representation to handle Unicode
    characters. The result is an ASCII filename safe for all filesystems.
    Decoding does the inverse of this.

    IDNA handling depends on the idna library, which is a new dependency
    added to Review Board 8. This supports IDNA 2008 standards with UTS46
    normalization, which amongst other things handles casing differences.

    Note that the standard Certificate, CertificateFingerprints, etc.
    objects do not normalize hostnames. They are a representation of their
    source. Whether that source is caller-supplied input, an X.509
    certificate, or a storage object, it will reflect the version of the
    hostname on there. That allows for creating an object that can represent
    a piece of state that can then be introspected or validated, which we do
    today.

    Unit tests pass.

    Summary ID
    Normalize, encode, and decode cert hostnames for storage.
    When a certificate represents a hostname, or a client requests one, that hostname may be presented in any casing (uppercase, lowercase, mixed case), which can pose issues for comparison. Further, when dealing with filesystem storage, we may encounter hostnames with non-ASCII characters in them, which may pose challenges depending on the filesystem. This change introduces casing normalization of hostnames in the storage objects to ease comparisons, and normalization/encoding/decoding in the file storage backend to handle encoding and representation differences. The base storage objects that deal with hostnames now keep a version of the hostname normalized for comparison purposes. This is a Unicode string that can resolve to a hostname, but with casing converted to lowercase. This eases comparison and gives a consistent representation of these hostnames. The file storage backend handles its own normalization and translation behavior when computing filenames for a given hostname. Encoding involves removing any trailing period on the hostname and then then converting to an IDNA 2008 representation to handle Unicode characters. The result is an ASCII filename safe for all filesystems. Decoding does the inverse of this. IDNA handling depends on the `idna` library, which is a new dependency added to Review Board 8. This supports IDNA 2008 standards with UTS46 normalization, which amongst other things handles casing differences. Note that the standard `Certificate`, `CertificateFingerprints`, etc. objects do not normalize hostnames. They are a representation of their source. Whether that source is caller-supplied input, an X.509 certificate, or a storage object, it will reflect the version of the hostname on there. That allows for creating an object that can represent a piece of state that can then be introspected or validated, which we do today.
    7787cc93803afad7474d670b33238f41ef1aa460
    Description From Last Updated

    Looks like reviewboard.certs.cert.Certificate.__init__ is also storing hostname, we should normalize there.

    david david

    Could we make a central helper for normalizing the hostname? Right now we only do casefold but it seems like …

    david david

    Can we add additional tests to verify mixed-case comparisons with certificates and fingerprints?

    david david

    This comparison is happening before we do any casefold()ing

    david david

    We should normalize here too.

    david david

    We should normalize here too.

    david david

    We should probably be normalizing the hostname here at the ingress point instead of deep inside _build*

    david david

    Same with the hostname here.

    david david

    And here.

    david david

    Is there a reason to use lower() here instead of casefold()?

    david david

    Apparently python's idna codec is old. We already have the idna package available because it's a dependency of cryptography, so …

    david david

    Typo: an decoded -> a decoded

    david david

    We don't return here anymore, we raise.

    david david

    This is a little confusing. How about "The hostname {hostname} contains invalid characters and cannot be stored"?

    david david

    Given that this isn't really a fatal error, we should probably use warning instead of error

    david david

    redefinition of unused 'test_init_with_unicode_hostname' from line 830 Column: 5 Error code: F811

    reviewbot reviewbot

    SyntaxError: unterminated string literal (detected at line 2223) Column: 22 Error code: E999

    reviewbot reviewbot
    david
    1. 
        
    2. Show all issues

      Looks like reviewboard.certs.cert.Certificate.__init__ is also storing hostname, we should normalize there.

      1. I intentionally left that out there in my final version. Originally I had it normalize there, but ultimately decided that the initially-constructed Certificate should be fully representative of any parsed state. Once it's been committed to the database and loaded back out, we'll be dealing with a normalized version, but we use a newly-constructed Certificate partly for verification and then logging purposes, so keeping it as consistent with the source data at that stage is important.

    3. Show all issues

      Could we make a central helper for normalizing the hostname? Right now we only do casefold but it seems like there might be other things we'd want to do in the future (trimming, IDNA conversion, etc).

      1. Went ahead and put this in the match_host() change.

    4. Show all issues

      Can we add additional tests to verify mixed-case comparisons with certificates and fingerprints?

    5. reviewboard/certs/storage/base.py (Diff revision 1)
       
       
      Show all issues

      This comparison is happening before we do any casefold()ing

    6. reviewboard/certs/storage/base.py (Diff revision 1)
       
       
      Show all issues

      We should normalize here too.

    7. Show all issues

      We should normalize here too.

    8. reviewboard/certs/storage/file_storage.py (Diff revision 1)
       
       
       
       
       
       
       
       
       
       
      Show all issues

      We should probably be normalizing the hostname here at the ingress point instead of deep inside _build*

      1. So here are my thoughts on normalization.

        There are two reasons to normalize:

        1. When comparing hostnames, making sure that two different-cased but otherwise identical hostnames are equal.
        2. When dealing with the filesystem.

        These ultimately may have very different normalization approaches as we go forward. We will likely, for instance, need to normalize Unicode characters in a cross-filesystem manner, encoding/decoding characters, but we wouldn't want to represent that version in our objects that way. Similarly, as above, we want to avoid straying too much from the original input too high up.

        So I see the hostname handling as follows:

        1. On the high-level classes (Certificate, etc.), we take the raw input as-is, and do not normalize.
        2. In the storage classes (FileStoredCertificate, FileStoredCertificateFingerprints, etc.), we can work with a normalized form for comparison purposes, because we're dealing with a wrapper around certificates and not certificates themselves. We're dealing with comparisons and lookup. So those objects can take that form.
        3. When dealing with the filesystem, we're dealing with a separate normalization, one for storage purposes. This is independent of any normalization that may have been done for comparison purposes. I'm going to make that more clear in the code.

        Given that, we don't want to normalize here. Normalization will happen when building FileStoredCertificateFingerprints, and storage normalization will happen in _build*.

        I'm also going to be improving storage normalization, so they'll be differing anyway.

    9. reviewboard/certs/storage/file_storage.py (Diff revision 1)
       
       
       
       
       
       
       
       
       
      Show all issues

      Same with the hostname here.

    10. reviewboard/certs/storage/file_storage.py (Diff revision 1)
       
       
       
       
       
       
       
       
       
      Show all issues

      And here.

    11. 
        
    chipx86
    david
    1. 
        
    2. Show all issues

      Is there a reason to use lower() here instead of casefold()?

      1. So the following is based on using .encode('idna'), and I spent all this time writing this, but things change with the idna package. I'm looking into that now.

        We're encoding for IDNA, and it ultimately doesn't matter in this case which we use when using .encode('idna'). lower() is cheaper (casefold() is more aggressive in how it modifies things), but .encode('idna') will do the right thing with either. When Unicode characters are present, it'll ultimately lowercase it all correctly, but when not, it'll leave it alone. So here's how it ends up working:

        # Differences between lower() and casefold() for a string with certain mixed-case unicode characters:
        >>> 'Straße.de'.lower()
        'straße.de'
        
        >>> 'Straße.de'.casefold()
        'strasse.de'
        
        >>> 'Éxamplé.COM'.lower()
        'éxamplé.com'
        
        >>> 'Éxamplé.COM'.casefold()
        'éxamplé.com'
        
        
        # IDNA encoding with mixed-case Unicode and mixed-case ASCII:
        >>> 'Straße.de'.encode('idna')
        b'strasse.de'
        
        >>> >>> 'FooBar'.encode('idna')
        b'FooBar'
        
        >>> 'Éxamplé.COM'.encode('idna')
        b'xn--xampl-9raf.COM'
        
        
        # Combining the two:
        >>> 'Straße.de'.lower().encode('idna')
        b'strasse.de'
        
        >>> 'Straße.de'.casefold().encode('idna')
        b'strasse.de'
        
        >>> 'Éxamplé.COM'.lower().encode('idna')
        b'xn--xampl-9raf.com'
        
        >>> 'Éxamplé.COM'.casefold().encode('idna')
        b'xn--xampl-9raf.com'
        

        Okay, that was fun. Now all that said, the idna package has a uts46=True mode, which is probably the right answer if we go with this package. It handles further normalization for user-provided domains.

    3. Show all issues

      Apparently python's idna codec is old. We already have the idna package available because it's a dependency of cryptography, so we should probably use idna.encode() here instead.

      1. I can't find anything indicating it's a dependency of cryptography. I only have it through requests. The cryptography source doesn't reference it except as a suggestion in an error message when failing to encode a provided hostname. If we want to use this, we'll need to add it explicitly.

      2. Worth pointing out, this module's a little bit heavyweight compared to the built-in stuff.

        I'm trying to decide if this is overkill or not. We ultimately just need something we can store in a predictable format. The IDNA version may not matter too much, but I need to learn the differences between them (and figure out what happens if the encoding changes).

    4. Show all issues

      Typo: an decoded -> a decoded

    5. 
        
    chipx86
    Review request changed
    Change Summary:
    • Added a dependency on the idna library, which we now use for the IDNA encoding/decoding (including case normalization).
    • Added new error handling if encoding/decoding fails.
    • Added unit tests and data files that cover these new failure conditions.
    • Fixed a typo in a docstring.
    Description:
       

    When a certificate represents a hostname, or a client requests one, that

        hostname may be presented in any casing (uppercase, lowercase, mixed
        case), which can pose issues for comparison. Further, when dealing with
        filesystem storage, we may encounter hostnames with non-ASCII characters
        in them, which may pose challenges depending on the filesystem.

       
       

    This change introduces casing normalization of hostnames in the storage

        objects to ease comparisons, and normalization/encoding/decoding in the
        file storage backend to handle encoding and representation differences.

       
       

    The base storage objects that deal with hostnames now keep a version of

        the hostname normalized for comparison purposes. This is a Unicode
        string that can resolve to a hostname, but with casing converted to
        lowercase. This eases comparison and gives a consistent representation
        of these hostnames.

       
       

    The file storage backend handles its own normalization and translation

        behavior when computing filenames for a given hostname. Encoding
    ~   involves removing any trailing period on the hostname, converting to
    ~   lowercase, and then converting to an IDNA representation to handle
    ~   Unicode characters. The result is an ASCII filename safe for all
    ~   filesystems. Decoding does the inverse of this.

      ~ involves removing any trailing period on the hostname and then
      ~ then converting to an IDNA 2008 representation to handle Unicode
      ~ characters. The result is an ASCII filename safe for all filesystems.
      ~ Decoding does the inverse of this.

      +
      +

    IDNA handling depends on the idna library, which is a new dependency

      + added to Review Board 8. This supports IDNA 2008 standards with UTS46
      + normalization, which amongst other things handles casing differences.

       
       

    Note that the standard Certificate, CertificateFingerprints, etc.

        objects do not normalize hostnames. They are a representation of their
        source. Whether that source is caller-supplied input, an X.509
        certificate, or a storage object, it will reflect the version of the
        hostname on there. That allows for creating an object that can represent
        a piece of state that can then be introspected or validated, which we do
        today.

    Commits:
    Summary ID
    Normalize, encode, and decode cert hostnames for storage.
    When a certificate represents a hostname, or a client requests one, that hostname may be presented in any casing (uppercase, lowercase, mixed case), which can pose issues for comparison. Further, when dealing with filesystem storage, we may encounter hostnames with non-ASCII characters in them, which may pose challenges depending on the filesystem. This change introduces casing normalization of hostnames in the storage objects to ease comparisons, and normalization/encoding/decoding in the file storage backend to handle encoding and representation differences. The base storage objects that deal with hostnames now keep a version of the hostname normalized for comparison purposes. This is a Unicode string that can resolve to a hostname, but with casing converted to lowercase. This eases comparison and gives a consistent representation of these hostnames. The file storage backend handles its own normalization and translation behavior when computing filenames for a given hostname. Encoding involves removing any trailing period on the hostname, converting to lowercase, and then converting to an IDNA representation to handle Unicode characters. The result is an ASCII filename safe for all filesystems. Decoding does the inverse of this. Note that the standard `Certificate`, `CertificateFingerprints`, etc. objects do not normalize hostnames. They are a representation of their source. Whether that source is caller-supplied input, an X.509 certificate, or a storage object, it will reflect the version of the hostname on there. That allows for creating an object that can represent a piece of state that can then be introspected or validated, which we do today.
    8ef5b55bd70f83aa73a5c59d1bcaa217d5bc32c6
    Normalize, encode, and decode cert hostnames for storage.
    When a certificate represents a hostname, or a client requests one, that hostname may be presented in any casing (uppercase, lowercase, mixed case), which can pose issues for comparison. Further, when dealing with filesystem storage, we may encounter hostnames with non-ASCII characters in them, which may pose challenges depending on the filesystem. This change introduces casing normalization of hostnames in the storage objects to ease comparisons, and normalization/encoding/decoding in the file storage backend to handle encoding and representation differences. The base storage objects that deal with hostnames now keep a version of the hostname normalized for comparison purposes. This is a Unicode string that can resolve to a hostname, but with casing converted to lowercase. This eases comparison and gives a consistent representation of these hostnames. The file storage backend handles its own normalization and translation behavior when computing filenames for a given hostname. Encoding involves removing any trailing period on the hostname and then then converting to an IDNA 2008 representation to handle Unicode characters. The result is an ASCII filename safe for all filesystems. Decoding does the inverse of this. IDNA handling depends on the `idna` library, which is a new dependency added to Review Board 8. This supports IDNA 2008 standards with UTS46 normalization, which amongst other things handles casing differences. Note that the standard `Certificate`, `CertificateFingerprints`, etc. objects do not normalize hostnames. They are a representation of their source. Whether that source is caller-supplied input, an X.509 certificate, or a storage object, it will reflect the version of the hostname on there. That allows for creating an object that can represent a piece of state that can then be introspected or validated, which we do today.
    7d5a227b55167c5f78acd8999437877bc1539693

    Checks run (1 failed, 1 succeeded)

    flake8 failed.
    JSHint passed.

    flake8

    chipx86
    david
    1. 
        
    2. reviewboard/certs/storage/file_storage.py (Diff revisions 2 - 4)
       
       
       
      Show all issues

      We don't return here anymore, we raise.

    3. reviewboard/certs/storage/file_storage.py (Diff revisions 2 - 4)
       
       
       
      Show all issues

      This is a little confusing. How about "The hostname {hostname} contains invalid characters and cannot be stored"?

    4. reviewboard/certs/storage/file_storage.py (Diff revisions 2 - 4)
       
       
       
       
      Show all issues

      Given that this isn't really a fatal error, we should probably use warning instead of error

    5. 
        
    chipx86
    Review request changed
    Change Summary:

    Improved logging, errors, and comments.

    Commits:
    Summary ID
    Normalize, encode, and decode cert hostnames for storage.
    When a certificate represents a hostname, or a client requests one, that hostname may be presented in any casing (uppercase, lowercase, mixed case), which can pose issues for comparison. Further, when dealing with filesystem storage, we may encounter hostnames with non-ASCII characters in them, which may pose challenges depending on the filesystem. This change introduces casing normalization of hostnames in the storage objects to ease comparisons, and normalization/encoding/decoding in the file storage backend to handle encoding and representation differences. The base storage objects that deal with hostnames now keep a version of the hostname normalized for comparison purposes. This is a Unicode string that can resolve to a hostname, but with casing converted to lowercase. This eases comparison and gives a consistent representation of these hostnames. The file storage backend handles its own normalization and translation behavior when computing filenames for a given hostname. Encoding involves removing any trailing period on the hostname and then then converting to an IDNA 2008 representation to handle Unicode characters. The result is an ASCII filename safe for all filesystems. Decoding does the inverse of this. IDNA handling depends on the `idna` library, which is a new dependency added to Review Board 8. This supports IDNA 2008 standards with UTS46 normalization, which amongst other things handles casing differences. Note that the standard `Certificate`, `CertificateFingerprints`, etc. objects do not normalize hostnames. They are a representation of their source. Whether that source is caller-supplied input, an X.509 certificate, or a storage object, it will reflect the version of the hostname on there. That allows for creating an object that can represent a piece of state that can then be introspected or validated, which we do today.
    a3938a0688f72a0326a9e502725faccbae72b81c
    Normalize, encode, and decode cert hostnames for storage.
    When a certificate represents a hostname, or a client requests one, that hostname may be presented in any casing (uppercase, lowercase, mixed case), which can pose issues for comparison. Further, when dealing with filesystem storage, we may encounter hostnames with non-ASCII characters in them, which may pose challenges depending on the filesystem. This change introduces casing normalization of hostnames in the storage objects to ease comparisons, and normalization/encoding/decoding in the file storage backend to handle encoding and representation differences. The base storage objects that deal with hostnames now keep a version of the hostname normalized for comparison purposes. This is a Unicode string that can resolve to a hostname, but with casing converted to lowercase. This eases comparison and gives a consistent representation of these hostnames. The file storage backend handles its own normalization and translation behavior when computing filenames for a given hostname. Encoding involves removing any trailing period on the hostname and then then converting to an IDNA 2008 representation to handle Unicode characters. The result is an ASCII filename safe for all filesystems. Decoding does the inverse of this. IDNA handling depends on the `idna` library, which is a new dependency added to Review Board 8. This supports IDNA 2008 standards with UTS46 normalization, which amongst other things handles casing differences. Note that the standard `Certificate`, `CertificateFingerprints`, etc. objects do not normalize hostnames. They are a representation of their source. Whether that source is caller-supplied input, an X.509 certificate, or a storage object, it will reflect the version of the hostname on there. That allows for creating an object that can represent a piece of state that can then be introspected or validated, which we do today.
    7787cc93803afad7474d670b33238f41ef1aa460

    Checks run (1 failed, 1 succeeded)

    flake8 failed.
    JSHint passed.

    flake8

    chipx86
    Review request changed
    Status:
    Completed
    Change Summary:
    Pushed to release-8.x (8983438)