Fix unicode errors generting diffs with git-p4

Review Request #12255 — Created April 25, 2022 and submitted

Information

RBTools
master

Reviewers

When using git-p4 as the underlying client with python 3 runtime, a
UnicodeDecodeError is thrown if any of the files to be diffed contained
non-ASCII text.

This was because the `make_perforce_diff` function was using a mixture
of bytes and str objects. The solution is to keep everything in bytes
and never try to intepret the characters returned from the diff other
than in very specific ways it always has.

This was a frequent error for us at work. I posted the solution to the issue tracker a long time ago, but I've since been unable to find the specific issue that was raised. I have been using this patch for at least 3 years.


Add a file named test-Œ.txt with contents ŒŒŒŒ.
rbt post that.

You can see inthe --debug output that the commands were successful:

>>> Running: p4 files //depot/MMA/core/main/test/test-Œ.txt@2399980
>>> Running: p4 files //depot/MMA/core/main/test/test.txt@2399980

I added print() statement and inspected the generated diff too:

Unicode: --- //depot/MMA/core/main/test/test-Œ.txt      //depot/MMA/core/main/test/test-Œ.txt#1
+++ //depot/MMA/core/main/test/test-Œ.txt       TIMESTAMP
@@ -0,0 +1 @@
+ŒŒŒŒ
--- //depot/MMA/core/main/test/test.txt //depot/MMA/core/main/test/test.txt#1
+++ //depot/MMA/core/main/test/test.txt TIMESTAMP
@@ -0,0 +1 @@
+ŒŒŒŒŒŒŒ

b'Raw: --- //depot/MMA/core/main/test/test-\xc5\x92.txt\t//depot/MMA/core/main/test/test-\xc5\x92.txt#1\n+++ //depot/MMA/core/main/test/test-\xc5\x92.txt\tTIMESTAMP\n@@ -0,0 +1 @@\n+\xc5\x92\xc5\x92\xc5\x92\xc5\x92\n--- //depot/MMA/core/main/test/test.txt\t//depot/MMA/core/main/test/test.txt#1\n+++ //depot/MMA/core/main/test/test.txt\tTIMESTAMP\n@@ -0,0 +1 @@\n+\xc5\x92\xc5\x92\xc5\x92\xc5\x92\xc5\x92\xc5\x92\xc5\x92\n'

Funny enought he server version we're on just aborts the request, but i think the client behaved correctly:

>>> Command line: rbt post --debug -r 93808 HEAD
>>> Running: git rev-parse HEAD
>>> Running: git rev-parse 48ecac22ee23eb524754fd08b1c651584104786c^
>>> Running: git branch --remotes
>>> Running: git rev-list 6e25de453e378d6f3a5927a5ed1d17b15a7517b9 --not --remotes=p4
>>> Making HTTP GET request to http://reviews.eu.fidessa.priv/api/review-requests/93808/?only-fields=absolute_url%2Cbugs_closed%2Cid%2Cstatus%2Cpublic&only-links=diffs%2Cdraft
>>> Cached response for HTTP GET http://reviews.eu.fidessa.priv/api/review-requests/93808/?only-fields=absolute_url%2Cbugs_closed%2Cid%2Cstatus%2Cpublic&only-links=diffs%2Cdraft expired and was not modified
>>> Running: git version
>>> Running: git -c core.quotepath=false diff --no-color --no-prefix -r -u --no-ext-diff 6e25de453e378d6f3a5927a5ed1d17b15a7517b9..48ecac22ee23eb524754fd08b1c651584104786c
>>> Running: git log 6e25de453e378d6f3a5927a5ed1d17b15a7517b9
>>> Running: p4 files //depot/MMA/core/main/test/test-Œ.txt@2399980
>>> Running: p4 files //depot/MMA/core/main/test/test.txt@2399980
>>> Making HTTP GET request to http://reviews.eu.fidessa.priv/api/validation/diffs/
>>> Cached response for HTTP GET http://reviews.eu.fidessa.priv/api/validation/diffs/ expired and was modified
>>> Making HTTP POST request to http://reviews.eu.fidessa.priv/api/validation/diffs/
Traceback (most recent call last):
  File "/mma/users/benj/.local/bin/rbt", line 33, in <module>
    sys.exit(load_entry_point('RBTools', 'console_scripts', 'rbt')())
  File "/mma/users/benj/proj/rbtools/rbtools/commands/main.py", line 207, in main
    command.run_from_argv([RB_MAIN, command_name] + args)
  File "/mma/users/benj/proj/rbtools/rbtools/commands/__init__.py", line 1043, in run_from_argv
    exit_code = self.main(*args) or 0
  File "/mma/users/benj/proj/rbtools/rbtools/commands/post.py", line 1022, in main
    self._validate_squashed_diff(squashed_diff)
  File "/mma/users/benj/proj/rbtools/rbtools/commands/post.py", line 1383, in _validate_squashed_diff
    **validate_kwargs)
  File "/mma/users/benj/proj/rbtools/rbtools/api/decorators.py", line 27, in request_method
    *args, **kwargs)
  File "/mma/users/benj/proj/rbtools/rbtools/api/transport/sync.py", line 83, in execute_request_method
    return self._execute_request(request)
  File "/mma/users/benj/proj/rbtools/rbtools/api/transport/sync.py", line 92, in _execute_request
    rsp = self.server.make_request(request)
  File "/mma/users/benj/proj/rbtools/rbtools/api/request.py", line 827, in make_request
    request.url, body, headers, request.method))
  File "/mma/users/benj/proj/rbtools/rbtools/api/cache.py", line 209, in make_request
    return self.urlopen(request)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 1346, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 1321, in do_open
    r = h.getresponse()
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/http/client.py", line 266, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
Summary ID
Fix unicode errors generting diffs with git-p4
When using git-p4 as the underlying client with python 3 runtime, a UnicodeDecodeError is thrown if any of the files to be diffed contained non-ASCII text. This was because the `make_perforce_diff` function was using a mixture of bytes and str objects. The solution is to keep everything in bytes and never try to intepret the characters returned from the diff other than in very specific ways it always has.
a95e1f9b2b80ec8610faa7fc6c13c62c9fcd9fcb
Description From Last Updated

I'm a little concerned about the server response. That just shouldn't happen. Was that repeatable? I'd love to see the …

chipx86chipx86

execute() expects commands to be built as Unicode strings, for greatest compatibility across Python versions. I wanted to determine whether …

chipx86chipx86

The outer parens shouldn't be needed here, as % will take precedence and the parens of args will handle line …

chipx86chipx86
chipx86
  1. Thanks, this is a good fix! A couple notes and then we can get this in for the next release.

  2. rbtools/clients/git.py (Diff revision 1)
     
     
     
     
     
    Show all issues

    execute() expects commands to be built as Unicode strings, for greatest compatibility across Python versions. I wanted to determine whether it was safe to maybe accept bytes, so I dug into the calls being made. There are three cases where the command is used:

    1. subprocess.Popen(), which does accept bytes on Python 2.7 and 3.6-3.10, but is documented to only accept strings. Unsure if this behavior may change in the future.

    2. subprocess.list2cmdline, which does not accept bytes on Python 3.6 or 3.7. We call this, but we do force strings to be Unicode before passing them in.

    3. Logging and error handling, which returns a string representation (regardless of type).

    I think ideally we'd accept bytes where possible, but we already do force to Unicode for list2cmdline. Given that, and the documentation around Popen()'s string types, my feeling is that we should build this up as Unicode strings instead of bytes.

    You can pass the base_path + ... to force_unicode() to make a best-attempt at that.

    1. I'll double check this, but my recollection was that the issue in question here was that one of base_path or filename is bytes due to having come back from some other subprocess call. My thought was that either have to stick with bytes or assume some encoding to .decode() one or both of them.

      But if that's what force_unicode() does, then i'll use that. Thanks.

    2. Done, using force_unicode()

  3. rbtools/clients/git.py (Diff revision 1)
     
     
     
     
    Show all issues

    The outer parens shouldn't be needed here, as % will take precedence and the parens of args will handle line continuation.

    Same below.

    1. Yeaah, sorry, i dont' remember why i added them :/ (probably refactoring then un-factoring!)

    2. fixed

  4. 
      
puremourning
chipx86
  1. 
      
  2. Show all issues

    I'm a little concerned about the server response. That just shouldn't happen. Was that repeatable?

    I'd love to see the logs from the server on that one.

    1. I was sure that i'd tested this and it worked with later reviewboard server, but it turns out i hadn't. Here's the log message from /var/log/reviewboard/reviewboard.log:

      2022-06-08 12:18:49,435 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Beginning parse of diff, size = 171
      2022-06-08 12:18:49,435 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Finished parsing diff.
      2022-06-08 12:18:49,442 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Beginning parse of diff, size = 194
      2022-06-08 12:18:49,443 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Finished parsing diff.
      2022-06-08 12:18:49,443 - ERROR -  - reviewboard.webapi.resources.validate_diff - Unexpected error when validating diff.
      Traceback (most recent call last):
        File "/usr/local/lib/python3.6/site-packages/reviewboard/webapi/resources/validate_diff.py", line 168, in create
          validate_only=True)
        File "/usr/local/lib/python3.6/site-packages/reviewboard/diffviewer/managers.py", line 775, in create_from_upload
          **kwargs)
        File "/usr/local/lib/python3.6/site-packages/reviewboard/diffviewer/managers.py", line 1070, in create_from_data
          validate_only=validate_only
        File "/usr/local/lib/python3.6/site-packages/reviewboard/diffviewer/filediff_creator.py", line 211, in create_filediffs
          encoding_list)[1],
        File "/usr/local/lib/python3.6/site-packages/reviewboard/diffviewer/diffutils.py", line 112, in convert_to_unicode
          raise TypeError('Value to convert is unexpected type %s', type(s))
      TypeError: ('Value to convert is unexpected type %s', <class 'reviewboard.scmtools.core.Revision'>)
      

      Corresponding client log (with additional logging of the generated diff):

      >>> Command line: rbt post -S --server ukfd-reviews2 --debug HEAD
      >>> Running: git rev-parse HEAD
      >>> Running: git rev-parse dc94be58c0161c5f653e235c2b01b433fb2c4a66^
      >>> Running: git branch --remotes
      >>> Running: git rev-list a8d4940a3cd86ec4c6835640e3657f063876ab52 --not --remotes=p4
      >>> Running: git rev-parse a8d4940a3cd86ec4c6835640e3657f063876ab52^
      >>> Found youngest remote git commit fd8a5f35639624f329125e4f6e11399a7a417237
      >>> Running: git version
      >>> Running: git -c core.quotepath=false diff --no-color --no-prefix -r -u --no-ext-diff a8d4940a3cd86ec4c6835640e3657f063876ab52..dc94be58c0161c5f653e235c2b01b433fb2c4a66
      >>> Running: git log fd8a5f35639624f329125e4f6e11399a7a417237
      >>> Running: p4 files //depot/MMA/core/main/test/test-Œ.txt@2111515
      >>> Generated perforce diff: --- //depot/MMA/core/main/test/test-Œ.txt  //depot/MMA/core/main/test/test-Œ.txt#1
      +++ //depot/MMA/core/main/test/test-Œ.txt       2022-06-08 13:34:29
      @@ -0,0 +1 @@
      +ŒŒŒŒ
      
      >>> Generated raw perforce diff: b'--- //depot/MMA/core/main/test/test-\xc5\x92.txt\t//depot/MMA/core/main/test/test-\xc5\x92.txt#1\n+++ //depot/MMA/core/main/test/test-\xc5\x92.txt\t2022-06-08 13:34:29\n@@ -0,0 +1 @@\n+\xc5\x92\xc5\x92\xc5\x92\xc5\x92\n'
      >>> Running: git -c core.quotepath=false diff --no-color --no-prefix -r -u --no-ext-diff fd8a5f35639624f329125e4f6e11399a7a417237..a8d4940a3cd86ec4c6835640e3657f063876ab52
      >>> Running: git log fd8a5f35639624f329125e4f6e11399a7a417237
      >>> Running: p4 files //depot/MMA/core/main/readme.txt@2111515
      >>> Generated perforce diff: --- //depot/MMA/core/main/readme.txt       //depot/MMA/core/main/readme.txt#8
      +++ //depot/MMA/core/main/readme.txt    2022-06-08 13:34:29
      @@ -1,3 +1,5 @@
      +READ ME!
      +
       INTRODUCTION - $Id$
       ============
      
      
      >>> Generated raw perforce diff: b'--- //depot/MMA/core/main/readme.txt\t//depot/MMA/core/main/readme.txt#8\n+++ //depot/MMA/core/main/readme.txt\t2022-06-08 13:34:29\n@@ -1,3 +1,5 @@\n+READ ME!\n+\n INTRODUCTION - $Id$\n ============\n \n'
      >>> Generated parent diff size: 194 bytes
      >>> Making HTTP GET request to http://ukfd-reviews2/api/validation/diffs/
      >>> HTTP GET request to http://ukfd-reviews2/api/validation/diffs/ cannot be cached
      >>> Making HTTP POST request to http://ukfd-reviews2/api/validation/diffs/
      >>> Got API Error 224 (HTTP code 400): Unexpected error when validating the diff: ('Value to convert is unexpected type %s', <class 'reviewboard.scmtools.core.Revision'>)
      >>> Error data: {'err': {'code': 224, 'msg': "Unexpected error when validating the diff: ('Value to convert is unexpected type %s', <class 'reviewboard.scmtools.core.Revision'>)"}, 'stat': 'fail'}
      Traceback (most recent call last):
        File "/mma/users/benj/proj/rbtools/rbtools/api/request.py", line 827, in make_request
          request.url, body, headers, request.method))
        File "/mma/users/benj/proj/rbtools/rbtools/api/cache.py", line 209, in make_request
          return self.urlopen(request)
        File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 223, in urlopen
          return opener.open(url, data, timeout)
        File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 532, in open
          response = meth(req, response)
        File "/mma/users/benj/proj/rbtools/rbtools/api/request.py", line 370, in http_response
          response.info())
        File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 570, in error
          return self._call_chain(*args)
        File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 504, in _call_chain
          result = func(*args)
        File "/opt/rh/rh-python36/root/usr/lib64/python3.6/urllib/request.py", line 650, in http_error_default
          raise HTTPError(req.full_url, code, msg, hdrs, fp)
      urllib.error.HTTPError: HTTP Error 400: Bad Request
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/mma/users/benj/proj/rbtools/rbtools/commands/post.py", line 958, in main
          self._validate_squashed_diff(squashed_diff)
        File "/mma/users/benj/proj/rbtools/rbtools/commands/post.py", line 1567, in _validate_squashed_diff
          **validate_kwargs)
        File "/mma/users/benj/proj/rbtools/rbtools/api/decorators.py", line 27, in request_method
          *args, **kwargs)
        File "/mma/users/benj/proj/rbtools/rbtools/api/transport/sync.py", line 83, in execute_request_method
          return self._execute_request(request)
        File "/mma/users/benj/proj/rbtools/rbtools/api/transport/sync.py", line 92, in _execute_request
          rsp = self.server.make_request(request)
        File "/mma/users/benj/proj/rbtools/rbtools/api/request.py", line 829, in make_request
          self.process_error(e.code, e.read())
        File "/mma/users/benj/proj/rbtools/rbtools/api/request.py", line 803, in process_error
          rsp['err']['msg'])
      rbtools.api.errors.BadRequestError: Unexpected error when validating the diff: ('Value to convert is unexpected type %s', <class 'reviewboard.scmtools.core.Revision'>) (API Error 224: Diff Parsing Failed)
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/mma/users/benj/.local/bin/rbt", line 33, in <module>
          sys.exit(load_entry_point('RBTools', 'console_scripts', 'rbt')())
        File "/mma/users/benj/proj/rbtools/rbtools/commands/main.py", line 207, in main
          command.run_from_argv([RB_MAIN, command_name] + args)
        File "/mma/users/benj/proj/rbtools/rbtools/commands/__init__.py", line 1096, in run_from_argv
          exit_code = self.main(*args) or 0
        File "/mma/users/benj/proj/rbtools/rbtools/commands/post.py", line 968, in main
          % (msg_prefix, e))
      rbtools.commands.CommandError: Error validating diff
      
      Unexpected error when validating the diff: ('Value to convert is unexpected type %s', <class 'reviewboard.scmtools.core.Revision'>) (API Error 224: Diff Parsing Failed)
      

      Using the pure p4 client works, so it must actually be a bug in this change. I'll take a look.

    2. Actually this appears to be unrelated, and i can reproduce wwithout any of my changes. Steps to repro:

      • git p4 clone //some/path
      • edit a file that already exists, git commit
      • create a new file, git commit
      • rbt post HEAD

      I tested this with rbtools master, using python 2.7 (rbtools master doesn't work with python3).

      ukfd-ltp4l(benj) main.git/test>python2 ~/proj/rbtools/runtime2/bin/rbt post --server ukfd-reviews2 HEAD
      Review request #90602 posted.
      
      http://ukfd-reviews2/r/90602/
      http://ukfd-reviews2/r/90602/diff/
      ukfd-ltp4l(benj) main.git/test>git log
      ukfd-ltp4l(benj) main.git/test>ls
      log
      ukfd-ltp4l(benj) main.git/test>echo test > test
      ukfd-ltp4l(benj) main.git/test>git add test
      ukfd-ltp4l(benj) main.git/test>git commit -m 'test'
      [master f4c87d4] test
       1 file changed, 1 insertion(+)
       create mode 100644 test/test
      ukfd-ltp4l(benj) main.git/test>python2 ~/proj/rbtools/runtime2/bin/rbt post --server ukfd-reviews2 HEAD
      ERROR: Error validating diff
      
      Unexpected error when validating the diff: ('Value to convert is unexpected type %s', <class 'reviewboard.scmtools.core.Revision'>) (HTTP 400, API Error 224)
      

      server log:

      2022-06-08 13:13:54,353 - DEBUG -  - root - Reloading logging settings
      2022-06-08 13:13:54,412 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Beginning parse of diff, size = 184
      2022-06-08 13:13:54,412 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Finished parsing diff.
      2022-06-08 13:14:14,521 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Beginning parse of diff, size = 136
      2022-06-08 13:14:14,522 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Finished parsing diff.
      2022-06-08 13:14:14,527 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Beginning parse of diff, size = 184
      2022-06-08 13:14:14,527 - DEBUG -  - reviewboard.diffviewer.parser - PerforceDiffParser.parse_diff: Finished parsing diff.
      2022-06-08 13:14:14,528 - ERROR -  - reviewboard.webapi.resources.validate_diff - Unexpected error when validating diff.
      Traceback (most recent call last):
        File "/usr/local/lib/python3.6/site-packages/reviewboard/webapi/resources/validate_diff.py", line 168, in create
          validate_only=True)
        File "/usr/local/lib/python3.6/site-packages/reviewboard/diffviewer/managers.py", line 775, in create_from_upload
          **kwargs)
        File "/usr/local/lib/python3.6/site-packages/reviewboard/diffviewer/managers.py", line 1070, in create_from_data
          validate_only=validate_only
        File "/usr/local/lib/python3.6/site-packages/reviewboard/diffviewer/filediff_creator.py", line 211, in create_filediffs
          encoding_list)[1],
        File "/usr/local/lib/python3.6/site-packages/reviewboard/diffviewer/diffutils.py", line 112, in convert_to_unicode
          raise TypeError('Value to convert is unexpected type %s', type(s))
      TypeError: ('Value to convert is unexpected type %s', <class 'reviewboard.scmtools.core.Revision'>)
      

      Confirmed that without the '2' new commits in local repo, the changes in this branch work:

      ukfd-ltp4l(benj) main.git/test>git reset --hard remotes/p4/master
      HEAD is now at fd8a5f3 Initial import of //depot/MMA/core/main/ from the state at revision #head
      ukfd-ltp4l(benj) main.git/test>echo 'ŒŒŒŒ' > test-Œ.txt
      ukfd-ltp4l(benj) main.git/test>git add test-Œ.txt
      ukfd-ltp4l(benj) main.git/test>git commit -m 'add test-ŒŒŒŒ'
      [master 28a865f] add test-ŒŒŒŒ
       1 file changed, 1 insertion(+)
       create mode 100644 "test/test-\305\222.txt"
      ukfd-ltp4l(benj) main.git/test>~/proj/rbtools/runtime3/bin/rbt post -S --server ukfd-reviews2 HEAD
      Review request #90603 posted.
      
      http://ukfd-reviews2/r/90603/
      http://ukfd-reviews2/r/90603/diff/
      

      so in summary, this is not an issue introduced in this patch.

  3. 
      
puremourning
  1. Is anything further required from me on this?

    1. Hey Ben. Sorry, we've just been heads-down with a few concurrent deadlines (Review Board betas, RBTools work that has to be done before the next macOS comes out due to some changes there, and some obligations with partners). As soon as I come up for air, I'll be working with you on whatever may still be needed to get this landed.

    2. Filling you in on the latest here.

      This is still going in. A lot of the diff building is being rewritten right now for type safety and for multiple diff implementations. The type safety will avoid these sorts of issues in the future. Your work will be rolled into this for RBTools 4, and we'll probably backport this part of it for RBTools 3 after testing is complete.

  2. 
      
chipx86
  1. Ship It!
  2. 
      
puremourning
Review request changed

Status: Closed (submitted)

Change Summary:

Pushed to release-3.x (44772b7)
Loading...