Fix URL encoding issues in pysvn when viewing diffs in the UI

Review Request #10671 — Created Aug. 30, 2019 and updated

rpetti
Review Board
reviewboard

Resolve URL encoding issue for pysvn that results in diffs not being displayable in the UI if the file's path contains a '#' (and possibly other special URL characters, such as '?')

The original code appeared to append the file path to the repo URL, then parse the result as a URL in order to URL-encode the file path. This is nonsensical, so most of the code has been removed and replaced with simply URL-encoding the path before appending it to the SVN repository URL.

Please see [here|https://groups.google.com/d/msg/reviewboard/GufvNZe9MUE/A4tAwtPMAwAJ] for more context on the issue.

Currently running in our production server. No known side-effects so far.

Description From Last Updated

Looks like this doesn't pass our unit tests: ERROR: Testing SVN (pysvn) get_file ---------------------------------------------------------------------- Traceback (most recent call last): File ...

chipx86chipx86

E501 line too long (81 > 79 characters)

reviewbotreviewbot

E501 line too long (82 > 79 characters)

reviewbotreviewbot

E501 line too long (83 > 79 characters)

reviewbotreviewbot

W291 trailing whitespace

reviewbotreviewbot

W291 trailing whitespace

reviewbotreviewbot

W291 trailing whitespace

reviewbotreviewbot

I think rewriting URL parsing and rebuilding logic isn't the way to go. We're now essentially doing the work that ...

chipx86chipx86
Checks run (1 failed, 1 succeeded)
flake8 failed.
JSHint passed.

flake8

rpetti
chipx86
  1. I'm sorry we never got to this, Rob. It definitely got lost in the shuffle of this past few months, especially with all the student project and 4.0 beta 1 work. I really do appreciate the change, and understand your concerns. I promise that despite the lack of visibility lately, behind-the-scenes the project and our company backing it are in great shape.

    I dug into the reasoning for the old code, and I agree, it's not the right approach. I see why it was implemented that way -- they were trying to avoid quoting the file:// and http:// bit -- but with years of small code changes it became an increasingly bad and broken approach. There's some subtle things it's doing that the new code is not doing, and I'll need to do some testing around this to make sure nothing appears to break, but we'll this ready to land in 3.0.16.

    1. Sounds good, thank you!

  2. 
      
chipx86
  1. Ship It!

  2. 
      
chipx86
  1. 
      
  2. Looks like this doesn't pass our unit tests:

    ERROR: Testing SVN (pysvn) get_file
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/tests/test_svn.py", line 101, in test_get_file
        self.tool.get_file(self.repository.path + '/' + file, rev),
      File "/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/svn/__init__.py", line 160, in get_file
        return self.client.get_file(path, revision)
      File "/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/svn/pysvn.py", line 93, in get_file
        return self._do_on_path(self._get_file_data, path, revision)
      File "/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/svn/pysvn.py", line 66, in _do_on_path
        raise FileNotFoundError(path, revision, detail=exc)
    FileNotFoundError: The file "file:///Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/testdata/svn_repo/trunk/doc/misc-docs/Makefile" (revision 2) could not be found in the repository: '/file:/Users/chipx86/src/rb/trees/reviewboard/3.0/reviewboard/scmtools/testdata/svn_repo/trunk/doc/misc-docs/Makefile' path not found
    
    1. The file:// url being passed into pysvn looks correct to me, so why it pysvn not interpreting it correctly?

  3. 
      
rpetti
Review request changed

Change Summary:

Since the root cause of the issue was peforming a urlsplit on an unescaped path, we can instead use a different method for splitting out the path portion for escaping. This follows the original intent of the code more closely, while still avoiding the issues with special url characters in the path.

Diff:

Revision 3 (+3 -5)

Show changes

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

rpetti
Review request changed

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

rpetti
chipx86
  1. 
      
  2. reviewboard/scmtools/svn/pysvn.py (Diff revision 5)
     
     
     
     
     
     
     
     
     
     
     

    I think rewriting URL parsing and rebuilding logic isn't the way to go. We're now essentially doing the work that urlsplit and urlunsplit were meant to do, except it won't handle a variety of cases (Windows file paths or network locations like file://\\Host\Path paths, URLs with ports, or URLs with IPv6 addresses). Allowed URL formats are just too complex for anything but a really complex regex (which I don't want us to have to maintain).

    So keeping what we had was probably the closest to "correct" at this stage, given requirements of this function. But it's not ideal.

    So let's re-examine what we're actually wanting to do and what we are currently doing here.

    1. We're taking in a path that comes from a diff, which may be a path relative to a repository, or may be an absolute path (file-based or URL-based), depending on how the diff is constructed.
    2. We're then currently normalizing that to ensure we have an absolute, URL-based path that we can ultimately feed to libsvn.
    3. After we've normalized it, we're ensuring the path is quoted. Here's where things go wrong: we're losing anything after a ? or #, because urlsplit attempts to parse it out.
    4. We rebuild the absolute URL and begin work with it.

    We're doing too much.

    At stage 2, we are normalizing a URL. We're making a relative URL absolute. That means we already have the file path. This is where we should be handling any quoting.

    Your original approach tried to do that, but it did it by quoting the path going into normalize_path(). That prevented some code paths from matching correctly. The first check is whether the file path starts with the repo path, and since we were passing in a quoted scheme (file%3A//), we weren't going to get the result we wanted.

    The proper approach is to ditch all this URL building code here and instead do the right thing at the right time in normalize_path().

    That method looks like this:

    if path.startswith(self.repopath):
        norm_path = path
    elif path.startswith('//'):
        norm_path = self.repopath + path[1:]
    elif path[0] == '/':
        norm_path = self.repopath + path
    else:
        norm_path = '%s/%s' % (self.repopath, path)
    
    return norm_path.rstrip('/')
    

    In the first case, we can assume we have a fully normalized path, so we're good to go. We don't need to do anything here.

    The remaining cases are building a path. Now, a root SVN URL may, like any other file, containg special characters, but we assume that a URL coming in is going to be a proper URL, and that means characters like # and ? are going to be escaped. It's also unlikely that we'll find them here, but still. I think we can focus on path, since that's the bit that's almost guaranteed not to be escaped.

    So this would be the proper time to do a quote() operation, when we're adding path onto repopath.

    I'd suggest doing that here. We won't have to muck around with URL parsing or rebuilding, we're just taking what's from the diff file and making sure we can build a suitable URL out of it.

    The benefit to this is that we can share this fix with other functions that need to normalize paths but currently aren't doing so. They're also very likely to be broken as well.

    It'll be important to have unit tests for normalize_path for the various inputs. Our process is to introduce unit tests when we find a regression or bug and have new code fixing it, so we can be sure it works (now and in the future, and across Python versions). One unit test per condition, demonstrating paths containing normal characters, ?, and #.

    I understand if this is getting to the point where it's more of a time investment than you expected to put into it and you'd rather we take care of the rest.

  3. 
      
rpetti
Review request changed

Change Summary:

Here's a reimplementation, following your recommendations. I'm not able to create a test case since I'd need to add a file to the test SVN repo, which I don't believe I can do in the form of a diff review.

Diff:

Revision 6 (+10 -19)

Show changes

Checks run (2 succeeded)

flake8 passed.
JSHint passed.
Loading...