Give FileDiffs control over handling of encodings for source files.

Review Request #10775 — Created Nov. 1, 2019 and submitted

Information

Review Board
release-4.0.x
78b4aa4...

Reviewers

Historically, Review Board has depended on source files to either us a
UTF-8 or ISO-8859-15 encoding, or for the repository to supply a list of
compatible encodings. If a special encoding were ever used for a file,
and the repository later had its encodings list changed to exclude it,
file would no longer be able to be viewed in the diff viewer.

This encoding list was computed up-front when generating a diff for
file. We'd fetch the repository encoding list, pass it to calls like
get_original_file(), which would then pass that on to
convert_to_unicode(), which would try the built-in encodings and the
provided list of encodings. The caller was always responsible for
fetching or computing that encoding list up-front before calling any of
these methods.

This change rewrites all of this to make the process more dependent on
the FileDiff we're working on, and to record a known working encoding
for future use.

FileDiff.extra_data can now contain an 'encoding' key (conveniently
exposed as FileDiff.encoding), which specifies an explicit encoding to
use. This could in theory be set during diff parsing (which requires
some additional work still), but it will be set after we've successfully
converted a source file to Unicode and back.

If not set, we still respect the provided encoding list (though this is
now deprecated in get_original_file()), and then look up from the
repository. This will allow us to later work with FileDiffs backed by
multiple repositories without placing unnecessary burdens on the caller.

The machinery for diff chunk generation has been updated as well. Now,
RawDiffChunkGenerator.generate_chunks() can be given an encoding list
for the old and new content. This defaults to the encoding list normally
provided to the chunk generator, but subclasses (DiffChunkGenerator)
can pass in separate encoding lists, allowing for two FileDiffs on an
interdiff to be in separate encodings without breaking chunk generation.

Note that it's still possible for a patch to contain content in another
encoding that doesn't match what's in FileDiff.encoding. The chunk
generator will still try to do the right thing in this case, falling
back to the repository encodings.

Unit tests passed.

Made sure my various test diffs and interdiffs were still working as
expected.

Description From Last Updated

F811 redefinition of unused 'test_with_filediff_encoding_set' from line 3548

reviewbotreviewbot

Can we call this FALLBACK_ENCODING? Our real default is utf-8

daviddavid
Checks run (1 failed, 1 succeeded)
flake8 failed.
JSHint passed.

flake8

chipx86
chipx86
david
  1. 
      
  2. reviewboard/scmtools/models.py (Diff revision 3)
     
     
    Show all issues

    Can we call this FALLBACK_ENCODING? Our real default is utf-8

  3. 
      
chipx86
david
  1. Ship It!
  2. 
      
chipx86
Review request changed
Status:
Completed
Change Summary:
Pushed to release-4.0.x (4f98605)