Fix output of binary diff data in python 3

Review Request #11279 — Created Nov. 13, 2020 and submitted

danudey
RBTools
master
rbtools

Fix output of non-UTF8 diff data in python 3

In Python 3, "binary" and "unicode" (i.e. string) data are
distinct types, and only UTF-8 unicode data can be printed
to standard out. In diff.py currently, we attempt to decode
diff data as UTF-8, and we fail if this fails.

Diff data from many SCMs (Git as an example) is not guaranteed
to be UTF-8 data; though this is often a safe assumption, it
is not always the case, and Git will happily output binary
data to the terminal if it does not detect that binary data
as being binary data.

As an example, Chinese language data encoded as Big5 or GB2312
will not be detected as binary data, even though it is also not
UTF-8 data. This means that changing a file encoded in Big5
(in our case, Chinese localization data) makes it impossible
to run rbt diff to visually validate your changes.

In our use case, we also have a workflow of rbt diff to output
to a file, process and validate that diff, and then run rbt post
to submit the diff, so this breaks our workflow in Python 3.

The documentation for Python 3 sys.stdin/out/err notes that
if writing binary data to these streams is required, the stream's
buffer object should be used, so we apply this change when printing
the diff object.

This patch was made against master, but the same change applies
for both the 1.x and 2.x branches.

How SCMs detect binary data:

  • Git - Look for nul byte in the first 8000 bytes of the buffer.
  • SVN - Ensure the first 1024 bytes is 15% printable ASCII with no nul bytes

SImple testing:

  1. Add a Bit5-encoded file to a Git repository
  2. git --no-pager diff the repository; this should show that the diff contains binary data
  3. rbt diff the repository; this previously failed, but now succeeds
Summary
Output binary diff content as binary, rather than trying to decode it
Fix patch when git output contains non-UTF8 characters
Fix regression in diff command for Python 3
Update patch --print for non-UTF-8 diff content
Add newlines to Python 3 'sys.stdout.buffer.write'
Fix output of svn diffs in Python 3
Description From Last Updated

Worth noting that this only allows to push binary data to ReviewBoard; there are significant bugs preventing rbtools from actually …

danudeydanudey

F821 undefined name 'sys'

reviewbotreviewbot

One difference being introduced here is that print(...) will always add a trailing newline (even if the content ends with …

chipx86chipx86

Can you add a blank line before the if and between else's block and the print(), like: print() if six.PY2: …

chipx86chipx86

E501 line too long (86 > 79 characters)

reviewbotreviewbot

E261 at least two spaces before inline comment

reviewbotreviewbot

E501 line too long (86 > 79 characters)

reviewbotreviewbot

E261 at least two spaces before inline comment

reviewbotreviewbot

E501 line too long (90 > 79 characters)

reviewbotreviewbot

E261 at least two spaces before inline comment

reviewbotreviewbot

E501 line too long (83 > 79 characters)

reviewbotreviewbot
danudey
danudey
  1. 
      
    1. This is only the first step towards binary diff support; other aspects of the rbtools workflow (such as patching) do not current work with non-UTF-8 diff files.

  2. Worth noting that this only allows to push binary data to ReviewBoard; there are significant bugs preventing rbtools from actually fetching and applying those patches at the moment.

  3. 
      
danudey
Review request changed

Change Summary:

Add binary output handling from git during rbt patch

Commits:

Summary
-
Fix output of binary diff data in python 3
+
Output binary diff content as binary, rather than trying to decode it
+
Fix patch when git output contains non-UTF8 characters

Diff:

Revision 2 (+16 -6)

Show changes

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

danudey
danudey
chipx86
  1. This looks great! And good timing: We're releasing RBTools 2.0 basically this weekend. I have a couple changes that need to be made to ensure we don't regress on some output, but otherwise this looks very solid.

  2. rbtools/commands/diff.py (Diff revision 4)
     
     
     

    One difference being introduced here is that print(...) will always add a trailing newline (even if the content ends with one), but sys.stdout.buffer.write(...) will only add one if the content ends with one. We probably want to follow this statement up with a print().

    This applies to the other lines that move to sys.stdout.buffer as well.

  3. rbtools/commands/patch.py (Diff revision 4)
     
     
     
     
     
     
     

    Can you add a blank line before the if and between else's block and the print(), like:

    print()
    
    if six.PY2:
        ...
    else:
        ...
    
    print()
    

    We separate out statements from new blocks, to help visually distinguish them.

  4. 
      
danudey
Review request changed

Change Summary:

Re-add newlines to sys.stdout.buffer.write instances.

Commits:

Summary
-
Output binary diff content as binary, rather than trying to decode it
-
Fix patch when git output contains non-UTF8 characters
-
Fix regression in diff command for Python 3
-
Update patch --print for non-UTF-8 diff content
+
Output binary diff content as binary, rather than trying to decode it
+
Fix patch when git output contains non-UTF8 characters
+
Fix regression in diff command for Python 3
+
Update patch --print for non-UTF-8 diff content
+
Add newlines to Python 3 'sys.stdout.buffer.write'

Diff:

Revision 5 (+40 -10)

Show changes

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

danudey
danudey
Review request changed

Change Summary:

Fix up svn diff parsing/output for Python 3

Commits:

Summary
-
Output binary diff content as binary, rather than trying to decode it
-
Fix patch when git output contains non-UTF8 characters
-
Fix regression in diff command for Python 3
-
Update patch --print for non-UTF-8 diff content
-
Add newlines to Python 3 'sys.stdout.buffer.write'
+
Output binary diff content as binary, rather than trying to decode it
+
Fix patch when git output contains non-UTF8 characters
+
Fix regression in diff command for Python 3
+
Update patch --print for non-UTF-8 diff content
+
Add newlines to Python 3 'sys.stdout.buffer.write'
+
Fix output of svn diffs in Python 3

Diff:

Revision 7 (+46 -16)

Show changes

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

danudey
chipx86
  1. Ship It!
  2. 
      
danudey
Review request changed

Status: Closed (submitted)

Change Summary:

Pushed to release-2.0.x (0c1d630)
Loading...