Switch to Eugene Myer's O(ND) diff algorithm

Review Request #79 — Created June 17, 2007 and submitted


Review Board SVN (deprecated)


Python's SequenceMatcher had a lot of flaws we had to work around and produced unoptimal diffs. It also prevented us from doing intelligent whitespace trimming or extending it for function name scanning (which we'd like to do someday).

I added an implementation of Eugene Myers's O(ND) diff algorithm, which is used in GNU diff and other diff programs. While the diffs aren't always necessarily better, they're a step in the right direction. The whitespace trimming alone should improve many diffs though, especially when indenting code.

We remain compatible with the old SequenceMatcher diff wrapper code. A version compatibility flag is stored in the DiffSet model that indicates whether to use the old SequenceMatcher or the new MyersDiffer. This will keep existing review requests and comments intact.
Unit tests passed. I tested several diffs I have here and they all rendered as correctly as one can expect.
  1. A bunch of nits.  I don't know this algorithm, but I'm assuming it works since you spent so much time pulling your hair out.
  2. /trunk/reviewboard/diffviewer/myersdiff.py (Diff revision 1)
    Is this really necessary?  Won't list(self.get_opcodes()) work?
    1. Yep. Silly me. Removed this function. I actually meant to add the [opcode for ...] to the unit tests, but I was using this from the python shell to test. Should have realized list() would work though.
  3. Private functions should just use single underscores
  4. I'd rather see a "class NotReachedError(Exception)" somewhere and raise that, instead of a typeless string.
    1. Agreed. I was surprised this didn't exist in some form already. Any idea where we should put this? Djblets?
    2. How about just reviewboard.utils.  It's so tiny that putting it in djblets seems silly
  5. difflib's a standar library module, and should stay in that group
    1. Not sure what you mean? That I should just "import difflib"?
    2. No, just keep in the group with the other system modules
      (a-la PEP-8).  You moved it down into the local module
  6. /trunk/reviewboard/diffviewer/diffutils.py (Diff revision 1)
    raise "string" is deprecated.
    Use raise Exception("string")
    1. Ok. Should probably use NotReachedError here too.
    2. This isn't really a not-reached, just a normal input validation error.
  7. /trunk/reviewboard/diffviewer/diffutils.py (Diff revision 1)
    Does this mean we're diffing twice?
    1. This is only for the interline diffs. We'd never use the same differ for the files and for the lines.
    2. Ok, cool.
  8. One question; do we need to do anything for migration to work when updating the database schema or will default=0 handle that for us?
  1. Man, now that I look a little closer this really does feel like C code.  Oh well.  After you do the tests you talked to me about, this looks fine.  There are certainly little style cleanups that can be done, and I'm sure some careful examination could yield some more pythonic code, but you're the maintainer of this bit, not me ;-)
  2. You're inconsistent in your number of lines between methods in this class.
  3. What does the comment here mean?  It's not apparent whether it's explaining anything or just old code.
  4. This can be made a little bit sexier using "for i, item in enumerate(data)"
  5. Same here with enumerate
  1. It looks like this change was already checked in, but...
  2. s/dicards/discards/
    1. Eep, good catch, thanks.
  3. Are there any tests that demonstrate how this new diff is better than before?