Fix URL encoding issues in pysvn when viewing diffs in the UI
Review Request #10671 — Created Aug. 30, 2019 and discarded
Resolve URL encoding issue for pysvn that results in diffs not being displayable in the UI if the file's path contains a '#' (and possibly other special URL characters, such as '?')
The original code appeared to append the file path to the repo URL, then parse the result as a URL in order to URL-encode the file path. This is nonsensical, so most of the code has been removed and replaced with simply URL-encoding the path before appending it to the SVN repository URL.
Please see [here|https://groups.google.com/d/msg/reviewboard/GufvNZe9MUE/A4tAwtPMAwAJ] for more context on the issue.
Currently running in our production server. No known side-effects so far.
Description | From | Last Updated |
---|---|---|
Looks like this doesn't pass our unit tests: ERROR: Testing SVN (pysvn) get_file ---------------------------------------------------------------------- Traceback (most recent call last): File … |
chipx86 | |
E501 line too long (81 > 79 characters) |
reviewbot | |
E501 line too long (82 > 79 characters) |
reviewbot | |
E501 line too long (83 > 79 characters) |
reviewbot | |
W291 trailing whitespace |
reviewbot | |
W291 trailing whitespace |
reviewbot | |
W291 trailing whitespace |
reviewbot | |
I think rewriting URL parsing and rebuilding logic isn't the way to go. We're now essentially doing the work that … |
chipx86 |
-
I'm sorry we never got to this, Rob. It definitely got lost in the shuffle of this past few months, especially with all the student project and 4.0 beta 1 work. I really do appreciate the change, and understand your concerns. I promise that despite the lack of visibility lately, behind-the-scenes the project and our company backing it are in great shape.
I dug into the reasoning for the old code, and I agree, it's not the right approach. I see why it was implemented that way -- they were trying to avoid quoting the
file://
andhttp://
bit -- but with years of small code changes it became an increasingly bad and broken approach. There's some subtle things it's doing that the new code is not doing, and I'll need to do some testing around this to make sure nothing appears to break, but we'll this ready to land in 3.0.16.
-
-
Looks like this doesn't pass our unit tests:
ERROR: Testing SVN (pysvn) get_file ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/tests/test_svn.py", line 101, in test_get_file self.tool.get_file(self.repository.path + '/' + file, rev), File "/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/svn/__init__.py", line 160, in get_file return self.client.get_file(path, revision) File "/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/svn/pysvn.py", line 93, in get_file return self._do_on_path(self._get_file_data, path, revision) File "/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/svn/pysvn.py", line 66, in _do_on_path raise FileNotFoundError(path, revision, detail=exc) FileNotFoundError: The file "file:///Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/testdata/svn_repo/trunk/doc/misc-docs/Makefile" (revision 2) could not be found in the repository: '/file:/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/testdata/svn_repo/trunk/doc/misc-docs/Makefile' path not found
- Change Summary:
-
Since the root cause of the issue was peforming a urlsplit on an unescaped path, we can instead use a different method for splitting out the path portion for escaping. This follows the original intent of the code more closely, while still avoiding the issues with special url characters in the path.
- Diff:
-
Revision 3 (+3 -5)
-
-
I think rewriting URL parsing and rebuilding logic isn't the way to go. We're now essentially doing the work that
urlsplit
andurlunsplit
were meant to do, except it won't handle a variety of cases (Windows file paths or network locations likefile://\\Host\Path
paths, URLs with ports, or URLs with IPv6 addresses). Allowed URL formats are just too complex for anything but a really complex regex (which I don't want us to have to maintain).So keeping what we had was probably the closest to "correct" at this stage, given requirements of this function. But it's not ideal.
So let's re-examine what we're actually wanting to do and what we are currently doing here.
- We're taking in a path that comes from a diff, which may be a path relative to a repository, or may be an absolute path (file-based or URL-based), depending on how the diff is constructed.
- We're then currently normalizing that to ensure we have an absolute, URL-based path that we can ultimately feed to libsvn.
- After we've normalized it, we're ensuring the path is quoted. Here's where things go wrong: we're losing anything after a
?
or#
, becauseurlsplit
attempts to parse it out. - We rebuild the absolute URL and begin work with it.
We're doing too much.
At stage 2, we are normalizing a URL. We're making a relative URL absolute. That means we already have the file path. This is where we should be handling any quoting.
Your original approach tried to do that, but it did it by quoting the path going into
normalize_path()
. That prevented some code paths from matching correctly. The first check is whether the file path starts with the repo path, and since we were passing in a quoted scheme (file%3A//
), we weren't going to get the result we wanted.The proper approach is to ditch all this URL building code here and instead do the right thing at the right time in
normalize_path()
.That method looks like this:
if path.startswith(self.repopath): norm_path = path elif path.startswith('//'): norm_path = self.repopath + path[1:] elif path[0] == '/': norm_path = self.repopath + path else: norm_path = '%s/%s' % (self.repopath, path) return norm_path.rstrip('/')
In the first case, we can assume we have a fully normalized path, so we're good to go. We don't need to do anything here.
The remaining cases are building a path. Now, a root SVN URL may, like any other file, containg special characters, but we assume that a URL coming in is going to be a proper URL, and that means characters like
#
and?
are going to be escaped. It's also unlikely that we'll find them here, but still. I think we can focus onpath
, since that's the bit that's almost guaranteed not to be escaped.So this would be the proper time to do a
quote()
operation, when we're addingpath
ontorepopath
.I'd suggest doing that here. We won't have to muck around with URL parsing or rebuilding, we're just taking what's from the diff file and making sure we can build a suitable URL out of it.
The benefit to this is that we can share this fix with other functions that need to normalize paths but currently aren't doing so. They're also very likely to be broken as well.
It'll be important to have unit tests for
normalize_path
for the various inputs. Our process is to introduce unit tests when we find a regression or bug and have new code fixing it, so we can be sure it works (now and in the future, and across Python versions). One unit test per condition, demonstrating paths containing normal characters,?
, and#
.I understand if this is getting to the point where it's more of a time investment than you expected to put into it and you'd rather we take care of the rest.