Awesome new diff parser

Review Request #131 — Created Aug. 4, 2007 and submitted


Review Board SVN (deprecated)
124, 129


Our old method of parsing uploaded diffs was inefficient, prone to bugs and hard to extend. We used lsdiff in order to retrieve the indexes of each file section in a diff, but lsdiff has no knowledge of the special headers ("Index", "RCS", etc.) present in many diffs. This meant that we had to backtrack through the file and try to find known special headers. This led to ugly code that was hard to extend.

Furthermore, this was slower than it should have been. We made several passes through the diff. First, lsdiff parsed the file to give us the indexes. Then we split the file up into segments and backtracked to find special headers we cared about. Some SCMTools in development even did their own parsing through the file.

The new parser code fixes a lot of these problems by removing the lsdiff dependency and parsing the diff files itself. We now make a single pass through the file, looking for the start of a file header. We call a function (parse_change_header) that in turn calls two overridable functions: parse_special_header and parse_diff_header.

parse_special_header is designed to look for special headers, like the "Index:" line and such. It's passed a dictionary that it can stuff data into, which parse_change_header will use to construct the filediff. parse_special_header should look for what it cares about and then return the next line number after the special headers. If it finds nothing, it just returns the line number passed to it.

parse_diff_header is similar to parse_special_header, but understands the standard diff headers. Right now we just look for context and unified diff headers, and this is rarely useful to override, but there are cases (such as for the CVS support) where you may want to.

There are future improvements we can make, such as letting the parser skip lines when it encounters a diff chunk descriptor indicating how many lines are in the following chunk. This can be implemented at a later date, though.

The other change this makes, which SCMTool authors should be mindful of, is that SCMTool.getParser() has now been renamed to get_parser() for consistency.
Wrote several new unit tests. All diff parsing unit tests pass. I'm going to work on new unit tests before this goes in to cover everything we can think about, but wanted to get the rest of the code up for review.

Also uploaded several real diffs generated by Subversion. I still need to make sure everything works on Perforce with large complex diffs.
  2. /trunk/reviewboard/reviews/ (Diff revision 2)
    I don't think getting rid of this is right.  The idea was that form.create would set an error in its dict and then throw something.  If we don't catch the error here, the user will get a traceback instead of a nice error message.
    1. That was a mistake and I actually reverted it in my copy. Just forgot to update the diff.
    2. I know it sounds crazy, but maybe it'd be useful to add a comment explaining it!
      Yes, I am trying to get rid of that warning on our ohloh page ;-)
  3. Aside from that, this looks functionally OK.  I want a lot m ore comments, though.
  1. Looks good.