Git SCMTool

Review Request #80 — Created June 18, 2007 and submitted

Information

Review Board SVN (deprecated)
trunk
87

Reviewers

Update:

Style changes as per David's review. I also switched SCMException to SCMError

======================================================================

Update:

Include changes to the diff parsing test :)

======================================================================

Update:

Only change is storing the whole diff for a file, including the 'diff --git' line

======================================================================

Update:

This change now relies on the DiffParser class in http://reviews.review-board.org/r/99/

You can still only talk to a local git repository.

 - Refactored diff parsing to better support Git diff format
 - Added Unittests for diff parsing

======================================================================

Update:

New patch coming without DiffParser changes, better git diff parsing, tests.

======================================================================

Update:

 - made new DiffParser class
 - updated unittests
 - minimized code duplication in diff parsing 

I have made a DiffParser object now, as it seemed to be the easiest way to override specific behaviour without duplicating chunks of code. 

For other SCM systems, it may pay to split out the DiffParser class even more, so you only need to override discreet operations, but it works for parsing diffs generated with Git, and there is no real code duplication.


======================================================================

Update:

 - changed form.py call the parseDiff method on the SCMTool
 - made SCMTool.parseDiff use the default parser
 - update GitTool to reflect the above changes, but with custom per file parser

I have not looked into abstracting out the per file parser to remove code duplication

======================================================================

Update:

This diff now abstracts out the diff parsing in reviewboard a little bit:

 - an SCMTool subclass can optionally provide a whole new diff parsing method
 - an SCMTool subclass can now optionally provide a per-file diff parsing method (this was needed for Git)
 - the GitTool has been updated to provide its own per-file diff parsing method

The CVS tool could be updated to override the main diff parsing method, extract the information needed and then proceed by calling the standard parse function. This would abstract away the need for all SCMTools to have a parse_extra_info_headers method.

======================================================================

This is the start of my effort to get Git support for Review Board.

I don't think it is really ready, but I would like to see if I am heading in the right direction. At the moment it only handles talking to a Git repo on the local filesystem. This can be expanded when needed (works fine for us).

I feel like the whole usage of a revision is a bit clunky, as Git uses objects with unique identifiers. I have pretty much just used an object id as a revision, just so i could get it working. 
Update:

Unittest passes again :)

======================================================================

Update:

Current tests still pass, and I have added a unittest for testing the parsing of Git patches.


======================================================================

Update: 

I updated the unittests for using the DiffParser class, and it works fine. Again, I don't have/use Perforce, so it may pay off to try that with the new class.


======================================================================

Update:

All tests still pass

======================================================================

Update:

All current unit tests for diff parsing work. The Perforce tests are skipped, as I do not use this SCM or have the python modules installed.

======================================================================

We have it talking to a Git repository, applying patches and general review functionality is working.

No accompanying unittests yet.
david
  1. 
      
  2. trunk/reviewboard/diffviewer/parser.py (Diff revision 1)
     
     
     
     
     
     
     
     
     
    You might want to coordinate with Robin Ericsson on this -- he's doing something similar for the CVS tool.
    1. Agreed. A few people are having a chat about how this would work at http://groups.google.com/group/reviewboard/browse_thread/thread/10d3338f1202ea17
      
      I originally had that a SCMTool could provide an alternative to diffviewer.parser.parseFile, and that worked fine for Git, but maybe we shoud be able to provide an alternative implementation to the whole diffviewer.parser.parse function. 
      
      I only had to change 2 lines in diffviewer.parser.parseFile to get Git diffs to work, and the rest of that function was just a duplication of the original, so I thought I might provide a suggestion of abstracting out diffviewer.parser.parse in a seperate patch.
    2. See the latest diff for my suggestions
  3. 
      
chipx86
  1. 
      
  2. trunk/reviewboard/diffviewer/forms.py (Diff revision 2)
     
     
     
     
     
     
     
     
     
     
     
     
     
    I think what we should do instead of these checks is to always call SCMTool.parseDiff. The default implementation can just turn around and call diffparser.parse. This allows this whole block of code to simply be:
    
       files = tool.parseDiff(file["content"])
    
    
    And in the SCMTool:
    
        def parseDiff(self, data):
            return diffparser.parse(data)
    1. yeah sounds good, why didn't I think of that :)
      
      Ill update
  3. trunk/reviewboard/scmtools/git.py (Diff revision 2)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    We need to find out what we can abstract or make available in some form so that all this logic is never duplicated.
    1. Ill make the change above... and see what i can come up with here.
  4. 
      
david
  1. I've got a bunch of (mostly style nitpicking) comments for you.
  2. trunk/reviewboard/scmtools/tests.py (Diff revision 7)
     
     
     
    Add "XXX" in front of this so it's easily searchable
  3. trunk/reviewboard/scmtools/tests.py (Diff revision 7)
     
     
    Trailing whitespace
  4. trunk/reviewboard/scmtools/git.py (Diff revision 7)
     
     
     
    Add a blank line between standard library imports and local package imports
  5. trunk/reviewboard/scmtools/git.py (Diff revision 7)
     
     
    Trailing whitespace -- here and two below (see red highlighting in reviewboard diff viewer)
  6. trunk/reviewboard/scmtools/git.py (Diff revision 7)
     
     
    Another place to write XXX
  7. trunk/reviewboard/scmtools/git.py (Diff revision 7)
     
     
    You can just return false here.
  8. trunk/reviewboard/scmtools/git.py (Diff revision 7)
     
     
     
     
    You shouldn't need to implement this.
  9. trunk/reviewboard/scmtools/git.py (Diff revision 7)
     
     
    Two lines between classes
  10. trunk/reviewboard/scmtools/git.py (Diff revision 7)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    This seems pretty dense and deeply nested.  It'd be nice to try to split it up a bit with some comments explaining what's going on.
  11. trunk/reviewboard/scmtools/git.py (Diff revision 7)
     
     
    Wrap to 80 columns
  12. trunk/reviewboard/scmtools/git.py (Diff revision 7)
     
     
    Wrap to 80 columns if possible.
  13. 
      
chipx86
  1. Thanks for your hard work on this. I have some additional changes I'd like to see before this goes in.
    1. I haven't forgotten about this. My computer melted and im about to go away for a break. If it hasn't been done after that by someone else i will pick it up again.
      
      Cheers
  2. trunk/reviewboard/scmtools/tests.py (Diff revision 8)
     
     
     
     
    No need for the \ on the first line here or the second line. Python doesn't need this when the statements are in the parens.
    
    This applies to the other tests as well.
    
    It might be best to have a utility function in that class that takes the base filename ("git_simple.diff"), opens it, and returns the result of parse()[0].
  3. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
     
     
     
    Small thing, but can you alphabnetize the namespaces and things we're importing?
  4. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
     
     
     
     
    For consistency, put the text on the same alignment as the """
  5. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
     
     
     
    init before setting the variables.
  6. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    Can you describe what the problem is in more detail here? It'll help if other people are looking to contribute.
    1. What's there is basically a duplicate of SCMTool.file_exists() -- but I would guess that Nick is looking for a way to do this without having to get the whole file.
      
      You can use the return status of 'git cat-file -e <sha1>' to just check for existence, or use 'git cat-file -t <sha1>' to check that the sha1 represents a blob.
    2. Is that really any better/cheaper than just doing 'git cat-file blob <sha1>' and letting it fail if the ref doesn't exist (or isn't a blob)? I haven't looked at the code, but I imagine git would have to do as much work either way.
    3. My (informal) measurements on the linux tree show that 'cat-file -e' is definitely fastest, with 'cat-file -t' about 5-10% slower.  Whereas 'cat-file blob' is slower still, but by how much varies greatly.  The variance seems to be correlated to size and age.  I don't know a lot about git internals, but I think that '-e' and '-t' are just lookups into the object directory, whereas 'blob' requires actually extracting from the object pack.  With the blob you're also reading the entire file into Python just to throw it away.
      
      The times here are still very small -- 1-2 ms per file for '-e' and '-t', and anywhere from 2-20ms for 'blob'.  So it does make some difference, but it probably won't affect RB much as a whole.  Thanks to Amdahl's Law, optimizations here may very well fall into the noise.
    4. Yes, you're right. I was only thinking about the case where the file didn't exist (so there was no blob to extract and read into memory), but in the general case it will. Ok, I will have a look at adding something.
  7. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    Trailing whitespace.
  8. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    I'm hoping we can reduce this in the future. I assume the main problem is that lsdiff doesn't return the correct things for git diffs?
    
    I'll be writing a replacement for lsdiff that will likely be part of DiffParser. It would be nice if we could then condense this further so that the parsing of this diff would be handled once in the base class and then you could just override parseFile and such.
  9. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
     
     
    Same comment about the indentation.
  10. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    Can you put this in the class instead of in __init__?
  11. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
     
     
     
     
     
     
     
    Adding a blank line between all these comments will help for readability so that the code is less clumped together. Also before try: for:, etc.
  12. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    Don't keep the trailing newlines. We now handle newlines in a different area.
  13. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    Are we guaranteed that paths will start with a/ and b/? If so, it'd be nice to document this, but that seems strange to me.
  14. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    It'd be nice to have some small comment showing how exactly the resulting fields will look so that the filenames[0][3:] makes more sense.
  15. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    Another magic number of 3. What does this mean and can we make a constant in the class for it?
  16. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    Do we not need a space or newline here at the end of --git?
  17. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
     
    Blank line before each of these.
  18. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    Two lines before classes. This goes for the next class as well.
  19. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
    Trailing whitespace.
  20. 
      
chipx86
  1. 
      
  2. trunk/reviewboard/scmtools/git.py (Diff revision 8)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    There's a new diff parser in SVN (r852) that will change how this all works. Now you should have no need to overload parse(). Instead, you want to provide new parse_diff_header() and parse_special_header() methods.
    
    General rule is that if the header you care about is a replacement for the --- and +++ lines, you want to use parse_diff_header(). Otherwise, if it comes before that, use parse_special_header(). You can see how this works in the other SCMTools.
    1. I have a new patch that works with the new DiffParser, as well as other general clean up and increased test coverage. I'm not sure whether uploading it over Nick's work here is the polite thing to do...
      
      In any case, it's available here: http://mojain.com/static/review-board-git-support-v9.diff, along with a tarball of the test repo that the unit tests expect to find at scmtools/testdata/git_repo: http://mojain.com/static/git_repo.tar.gz.
    2. Oh, it looks like I don't have access to upload a new one anyway. :) Nick, or David/Christian, feel free to grab it from the above URL.
  3.