Fix output of binary diff data in python 3

Review Request #11279 — Created Nov. 13, 2020 and submitted — Latest diff uploaded

Information

RBTools
master

Reviewers

Fix output of non-UTF8 diff data in python 3

In Python 3, "binary" and "unicode" (i.e. string) data are
distinct types, and only UTF-8 unicode data can be printed
to standard out. In diff.py currently, we attempt to decode
diff data as UTF-8, and we fail if this fails.

Diff data from many SCMs (Git as an example) is not guaranteed
to be UTF-8 data; though this is often a safe assumption, it
is not always the case, and Git will happily output binary
data to the terminal if it does not detect that binary data
as being binary data.

As an example, Chinese language data encoded as Big5 or GB2312
will not be detected as binary data, even though it is also not
UTF-8 data. This means that changing a file encoded in Big5
(in our case, Chinese localization data) makes it impossible
to run rbt diff to visually validate your changes.

In our use case, we also have a workflow of rbt diff to output
to a file, process and validate that diff, and then run rbt post
to submit the diff, so this breaks our workflow in Python 3.

The documentation for Python 3 sys.stdin/out/err notes that
if writing binary data to these streams is required, the stream's
buffer object should be used, so we apply this change when printing
the diff object.

This patch was made against master, but the same change applies
for both the 1.x and 2.x branches.

How SCMs detect binary data:

  • Git - Look for nul byte in the first 8000 bytes of the buffer.
  • SVN - Ensure the first 1024 bytes is 15% printable ASCII with no nul bytes

SImple testing:

  1. Add a Bit5-encoded file to a Git repository
  2. git --no-pager diff the repository; this should show that the diff contains binary data
  3. rbt diff the repository; this previously failed, but now succeeds

Diff Revision 4

This is not the most recent revision of the diff. The latest diff is revision 8. See what's changed.

orig
1
2
3
4
5
6
7
8

Commits

First Last Summary ID Author
Output binary diff content as binary, rather than trying to decode it
8542ddcfe8e96bad4637baf685fac6238d6b80b0 Daniel Fox
Fix patch when git output contains non-UTF8 characters
6312d90b7a4ffd7edae1c8c760da0d2c538b5de4 Daniel Fox
Fix regression in diff command for Python 3
9cdec9006e6e1b5e54f7286b5aa0a5d0df93a836 Daniel Fox
Update patch --print for non-UTF-8 diff content
3a6bcf149cf1e44b1fe2b9b970744261290d8e58 Daniel Fox
rbtools/clients/git.py
rbtools/commands/diff.py
rbtools/commands/patch.py
Loading...