Fix some unicode-related errors in RBTools SVN parsing.
Review Request #7044 — Created March 11, 2015 and submitted
There are a few issues that have been cropping up with the diff parsing (and
other interfacing with output from 'svn' commands) since we moved over to
unicode_literals. This change fixes three of them.First of all, the results from 'svn info' can include non-UTF8 characters
depending on the locales that are available and the configuration of the
system. Luckily, all of the information we need from this can be parsed out in
binary mode.Second, when doing diffs of files that contain non-UTF8 characters, we were
attempting to process the diff as a bunch of unicode objects, rather than as
bytes. This was happening because we'd format some of the lines into unicode,
and then join everything together. I've fixed things so that all the processing
happens as bytes, so that we don't need to care about the encoding, and added a
new unit test for this behavior.Finally, Griffin Myers suggested a third test case, with a non-utf8 filename.
This has also been fixed, along with a new unit test for that case.Testing done:
Ran unit tests.
Ran unit tests.
-
I'm way out of my realm when it comes to unicode, locales, etc. -- I don't even have the basic vocabulary down right. But, your comment about 'svn info' returning non-UTF8 characters made me a little curious about why you didn't need to add results_unicode=False in the call to _run_svn() from svn_info(). I cooked up a test which currently fails, but I don't honestly know if it is a valid test. It attempts to copy an existing file (foo.txt) to a new filename which contains a non-ascii character: def test_diff_non_unicode_filename(self): """Testing SVNClient diff with a non-utf8 filename""" self.options.svn_show_copies_as_adds = 'y' filename = '\xe2' self._run_svn(['copy', 'foo.txt', filename]) self._run_svn(['propset', 'svn:mime-type', 'text/plain', filename]) revisions = self.client.parse_revision_spec() result = self.client.diff(revisions) self.assertTrue(isinstance(result, dict)) self.assertTrue('diff' in result) self.assertEqual(md5(result['diff']).hexdigest(), 'bfa99e54b8c23b97b1dee23d2763c4fd') I'll let you decide if there is any validity to this test. My, likely naive, fix was: diff --git a/rbtools/clients/svn.py b/rbtools/clients/svn.py index 55c74aa..23e7d3f 100644 --- a/rbtools/clients/svn.py +++ b/rbtools/clients/svn.py @@ -602,7 +602,7 @@ class SVNClient(SCMClient): root = info["Repository Root"] path = unquote(url[len(root):]) - line = b'%s %s%s' % (front, path.encode('utf-8'), rest) + line = b'%s %s%s' % (front, path, rest) result.append(line) @@ -622,7 +622,8 @@ class SVNClient(SCMClient): result = self._run_svn(["info", path], split_lines=True, ignore_errors=ignore_errors, - none_on_ignored_error=True) + none_on_ignored_error=True, + results_unicode=False) if result is None: return None diff --git a/rbtools/utils/process.py b/rbtools/utils/process.py index c6c79a0..a327d93 100644 --- a/rbtools/utils/process.py +++ b/rbtools/utils/process.py @@ -35,7 +35,7 @@ def execute(command, results_unicode=True): """Utility function to execute a command and return the output.""" if isinstance(command, list): - logging.debug('Running: ' + subprocess.list2cmdline(command)) + logging.debug(b'Running: ' + subprocess.list2cmdline(command)) else: logging.debug('Running: ' + command)
- Description:
-
There are a few issues that have been cropping up with the diff parsing (and
other interfacing with output from 'svn' commands) since we moved over to ~ unicode_literals. This change fixes two of them, one of which was posted to the ~ unicode_literals. This change fixes three of them. - mailing list, and the other to a bug. First of all, the results from 'svn info' can include non-UTF8 characters
depending on the locales that are available and the configuration of the system. Luckily, all of the information we need from this can be parsed out in binary mode. Second, when doing diffs of files that contain non-UTF8 characters, we were
attempting to process the diff as a bunch of unicode objects, rather than as bytes. This was happening because we'd format some of the lines into unicode, and then join everything together. I've fixed things so that all the processing happens as bytes, so that we don't need to care about the encoding, and added a new unit test for this behavior. + + Finally, Griffin Myers suggested a third test case, with a non-utf8 filename.
+ This has also been fixed, along with a new unit test for that case. + + Testing done:
+ Ran unit tests. - Commit:
-
2ee2646f0c7181d9c8f0c12a22e04a2db994a6da672ae8c39519efdf829cf3d792614f19b94b5d09