• 
      

    Rework encoding of Revisions during diff processing to avoid regressions.

    Review Request #12825 — Created Feb. 6, 2023 and submitted

    Information

    Review Board
    release-4.0.x

    Reviewers

    Due to a bug during a refactor late in 4.0.x, we were attempting to
    store a raw Revision object in extra_data. The intent of the code
    was to convert a revision to something we could store by using
    convert_to_unicode() and factoring in the repository's configured
    encoding list. However, this didn't know how to convert Revision
    objects.

    We received a patch that addresses this by performing a conversion
    before passing it in, which guaranteed we'd get a string out of it. This
    used force_str()/force_text(). However, that performs its decoding
    from byte strings to Unicode strings, and makes its own assumptions
    about the encoding type that should be used. That could result in an
    incorrectly-decoded string.

    This isn't a major issue for revisions on most SCMs, which are generally
    using ASCII (though, there's no guarantee here). However, the filename
    was also force-decoded prematurely, which could impact some setups.

    This change removes the force-decoding for filenames entirely, which was
    probably just added as a "just in case" measure, and then does a very
    explicit Revision to bytes conversion before passing to
    convert_to_unicode(). This is the safest way to handle this, avoiding
    the heuristics in Django's string conversion machinery, and gets us the
    result we want.

    Unit tests were added to cover this case.

    Unit tests pass. Verified the new test case failed with the reported
    error without the fix.

    Summary ID
    Rework encoding of Revisions during diff processing to avoid regressions.
    Due to a bug during a refactor late in 4.0.x, we were attempting to store a raw `Revision` object in `extra_data`. The intent of the code was to convert a revision to something we could store by using `convert_to_unicode()` and factoring in the repository's configured encoding list. However, this didn't know how to convert `Revision` objects. We received a patch that addresses this by performing a conversion before passing it in, which guaranteed we'd get a string out of it. This used `force_str()`/`force_text()`. However, that performs its decoding from byte strings to Unicode strings, and makes its own assumptions about the encoding type that should be used. That could result in an incorrectly-decoded string. This isn't a major issue for revisions on most SCMs, which are generally using ASCII (though, there's no guarantee here). However, the filename was also force-decoded prematurely, which could impact some setups. This change removes the force-decoding for filenames entirely, which was probably just added as a "just in case" measure, and then does a very explicit `Revision` to `bytes` conversion before passing to `convert_to_unicode()`. This is the safest way to handle this, avoiding the heuristics in Django's string conversion machinery, and gets us the result we want. Unit tests were added to cover this case.
    7dc042e96131b0d8069da4a01801ba2b48d90ba4
    david
    1. Ship It!
    2. 
        
    chipx86
    Review request changed
    Status:
    Completed
    Change Summary:
    Pushed to release-4.0.x (c1e830a)