Detect and display moved parts in a file.

Review Request #926 — Created July 21, 2009 and submitted

Information

Review Board SVN (deprecated)

Reviewers

This diff builds up on my previous whitespace diff, so it also contains those changes, as mercurial could not keep them apart.

Further improving the diff viewer is the detection of whole blocks move inside a file. A move is characterized by a delete in one place and the insert of the same text on another location. The main problem for this effect is the complexity of the analysis required for this detection. So for now, I opted to do whole blocks only, but the interface and internal representation supports an arbitrary technique, so it is possible to build upon them.

Once a move is detected it is shown according to the screenshots below. Each moved line gets a handle containing information as from where that line came from, or were it went. When the handle is clicked the page jumps to the surrounding chunk, that is highlighted, and the corresponding line is highlighted for 2 seconds, with a bright yellow.
Tested all sorts of moves. File with only moves, mixed moves and edit and files where there is several instances of the destination (i.e. when the user cuts a chunk and pastes several times). In the case of multiple destinations, the last one found is the one considered target, while the others are ignored.

The effects it self were tested on Safari 4.0, Chrome 3.0 preview, Firefox 3.5 and Firefox 3.0
chipx86
  1. 
      
    1. Many of these comments have to do with the whitespace change, so I'll just move them over to there.
  2. This doesn't really describe the function. Might be better as a comment.
    
    Also, "Access"
    1. Fixed on whitespace patch.
  3. Trailing whitespace.
  4. Should be removed before committing.
  5. Same here.
  6. Should also be removed.
  7. Trailing whitespace.
    
    I don't know that the comment is really necessary.
    1. Fixed on whitespace patch
  8. "WS" -> "whitespace"
  9. /trunk/reviewboard/diffviewer/diffutils.py (Diff revision 1)
     
     
     
    Space after the :
    
    Also, on the comments, there should be a space after #, and the comment should end with a period.
  10. /trunk/reviewboard/diffviewer/diffutils.py (Diff revision 1)
     
     
     
     
     
     
     
    Would you mind adding some comments explaining this code? It's sufficiently complex.
  11. Space on both sides of +
  12. When does r ever change?
    1. Major infinite loop bug, that wasn't triggered in my tests because of the break inside the 'if' statement. Fixed.
  13. /trunk/reviewboard/diffviewer/diffutils.py (Diff revision 1)
     
     
     
    Since we do each range twice, we might as well store this in variables and reuse it.
  14. Space before "left"
  15. Can be: 0px 15px;
  16. Space before "right"
  17. Please add a documentation comment above this describing the function.
  18. Since we reference table.sidebyside instances many times, we should store the result of $("table.sidebyside") in a variable and reference that variable for all other uses.
  19. Can you put comments on hteir own lines? With a blank line preceding them.
  20. Blank line before this.
  21. Two blank lines here.
  22. No space after parent().
  23. Should have translation for the "Moved from ..."
  24. Translation here too.
  25. This is probably part of the other change, but rather than having onclick, I'd prefer that diffviewer.js register these actions for these links. You can give them IDs to reference them.
  26. 
      
edufelipe
chipx86
  1. Hey Eduardo,
    
    Can you update this diff now that the whitespace change is in?
    
    I'm trying to think of ways to optimize the algorithm so we don't loop so much at the end. I think we can do this a few different ways.
    
    While doing the initial iteration and deciding whether to dump the range into an inserts or deletes bucket, we could generate some sort of a hash of the data and store it in a dictionary. The dictionary would map the hash to a list of groups. This may only be needed for the removes bucket.
    
    Then, after putting them into buckets, we can go through the inserts list and see if its generated hash exists in the removes bucket. If it does, loop through the list associated with that hash and do the comparison.
    
    This has the benefit of only introducing one mandatory loop, the one for the inserts list. Instead of looping through all the remaining removes on each insert, we only need to do one cheap lookup, and then iterate through a small list (probably 1 or 2) of items for comparison.
    
    The hash tag could be something as simple as "%d-%d-%s-%s" % (i2 - i1, j2 - j1, differ.a[:10], differ.a[-10:]), which would potentially have more matches to search through, but would be simple and quick to generate. Probably better than going through all the data and generating a checksum, which would cost more than a couple extra small loops.
    
    So, something like:
    
    
        for ... in differ.get_opcodes():
            ...
            if tag == 'delete':
                key = "%d-%d-%s-%s" % (j2 - j1, i2 - i1, differ.b[10:], differ.b[-10:])
    
                if not key in removes:
                    removes[key] = []
    
                removes[key].append(groups[-1])
            else:
                key = "%d-%d-%s-%s" % (i2 - i1, j2 - j1, differ.a[:10], differ.a[-10:])
    
                if not key in inserts:
                    inserts[key] = []
    
                inserts[key].append(groups[-1])
    
        for insert_key, groups in inserts.iteritems():
            if insert_key in removes:
                # Do comparisons here
    
    
    This is just a rough idea at 1 in the morning, so it's probably buggy at best, but try out a version of this and see if you can make it work. I want to be sure this algorithm is as optimal as possible.
    
    Thanks for your hard work on this so far!
    1. Er, that differ.a[:10] should be differ.a[10:]
    2. I'll look into it and post an update later, for now I'll just post without whitespace.
  2. /trunk/reviewboard/diffviewer/diffutils.py (Diff revision 2)
     
     
     
     
    For this, you can use:
    
    for r in xrange(0, len(removes)):
  3. As an optimization, I think we can check to see if the length of each range is the same before doing any data comparison.
  4. /trunk/reviewboard/diffviewer/diffutils.py (Diff revision 2)
     
     
     
    This pretty much has the same effect as:
    
    return groups
    1. And I miss the obvious thing. Thanks.
  5. 
      
edufelipe
Review request changed