Add extra context for added and deleted empty files.

Review Request #5802 — Created May 12, 2014 and submitted

Information

RBTools
master
fb2d6fd...

Reviewers

Review Board could not handle added or deleted empty files in SVN, Mercurial,
and Perforce diffs since the diff output from svn diff/hg diff did not
provide enough context on empty files. This change adds extra processing in
RBTools to generate a diff with extra information on any added or deleted empty
files, if the Review Board server has the empty_files capability for the
SCM tool in question.

The original SVN diff output of an added/deleted empty file was:

Index: foo
===================================================================

After this change, the SVN diff of an added empty file becomes:

Index: foo\t(added)
===================================================================
--- foo\t(<base_revision>)
+++ foo\t(<tip_revision>)

For Mercurial, the diff of an added empty file is now:

diff -r <base_revision> -r <tip_revision> foo
--- /dev/null\tThu Jan 01 00:00:00 1970 +0000
+++ b/foo\t<date>

For Perforce, the diff of an added empty file is:

==== //depot/foo#0 ==A== //depot/foo ====

Ran rbt post on a RB server with the empty_files capabilities with different
SVN, Mercurial, and Perforce diffs, and verified that the modified diffs were
as expected:
- Diff with 1 added empty file
- Diff with 1 added empty file, 1 added non-empty file
- Diff with 1 deleted empty file
- Diff with 1 deleted empty file, 1 deleted non-empty file
- Diff with 1 added empty file, 1 deleted empty file
- Diff with 1 added empty file, 1 added non-empty file, 1 deleted empty file,
1 deleted non-empty file, 1 modified file

Ran rbt post on a RB server without the empty_files capabilities, and
verified that the generated diffs did not include any extra information on
added/deleted empty files. These added/deleted empty files were not included in
the review requests.

Ran unit tests. More testing was done in /r/5785 as well.

Description From Last Updated

It would be nice to use re.compile() outside the for loop, so it won't recompile the regex each iteration.

daviddavid

It would be nice to use re.compile() outside the for loop, so it won't recompile the regex each iteration.

daviddavid

'files' will always be a list, so you can just use the list comprehension version and skip the if/else. If …

daviddavid

Should this be _handle_empty_files?

daviddavid

Put spaces around the +s

daviddavid

This could be if dl == [] and not is_move and changetype_short not in ('A', 'D'):

daviddavid

Col: 5 E129 visually indented line with same indent as next logical line

reviewbotreviewbot

I dare say that parsing diffs should never be the answer. When you emailed me, I neglected to tell you …

IN indygreg

I would use hg locate to obtain the set of all files in each changeset. Take the set difference to …

IN indygreg

You should be using :: instead of : for the revision set. : is revision order (order changesets were added …

IN indygreg

I would return a set from _get_files_in_changeset() and use the overloaded operators on set. e.g. added_empty_files = tip_empty_files - base_files

IN indygreg

This logic is kind of confusing. How about this: is_empty_and_changed = (self._supports_empty_files() and changetype_short in ('A', 'D')) if dl == …

daviddavid
anselina
anselina
david
  1. 
      
  2. rbtools/clients/mercurial.py (Diff revision 3)
     
     
    Show all issues

    It would be nice to use re.compile() outside the for loop, so it won't recompile the regex each iteration.

    1. I thought about this, but couldn't think of a way to pre-compile the regex with the variable filename since it changes with each iteration of the loop. I may just be overlooking something though?

    2. Oh, you're right, I hadn't noticed that it's formatting it in there.

      But not that I have noticed it, you should be formatting in re.escape(filename) so that any regex characters in the filename won't bleed through.

  3. rbtools/clients/mercurial.py (Diff revision 3)
     
     
    Show all issues

    It would be nice to use re.compile() outside the for loop, so it won't recompile the regex each iteration.

  4. rbtools/clients/mercurial.py (Diff revision 3)
     
     
     
     
     
     
    Show all issues

    'files' will always be a list, so you can just use the list comprehension version and skip the if/else. If files is empty, it's the same as returning []

  5. rbtools/clients/svn.py (Diff revision 3)
     
     
    Show all issues

    Should this be _handle_empty_files?

  6. rbtools/clients/svn.py (Diff revision 3)
     
     
     
     
    Show all issues

    Put spaces around the +s

  7. 
      
anselina
reviewbot
  1. This is a review from Review Bot.
      Tool: PEP8 Style Checker
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. rbtools/clients/perforce.py (Diff revision 4)
     
     
    Show all issues
    Col: 5
     E129 visually indented line with same indent as next logical line
    
  3. 
      
reviewbot
  1. This is a review from Review Bot.
      Tool: Pyflakes
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
david
  1. 
      
  2. rbtools/clients/perforce.py (Diff revision 4)
     
     
     
    Show all issues

    This could be if dl == [] and not is_move and changetype_short not in ('A', 'D'):

  3. 
      
anselina
reviewbot
  1. This is a review from Review Bot.
      Tool: PEP8 Style Checker
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
reviewbot
  1. This is a review from Review Bot.
      Tool: Pyflakes
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
IN
  1. I only looked at the Mercurial changes.

    I would rewrite things in terms of hg locate. I believe the implementation will be much cleaner.

    I apologize for not informing you of the existence of filesets via email. It simply slipped my mind. I didn't intend to make you look bad. If another problem I was trying to solve hadn't required me to look at the help page for filesets and see it offers a size() query, I likely wouldn't have remembered at all!

    1. Thanks a lot for taking the time to look at this - I had no idea Mercurial had such a powerful query language. :)

      I definitely agree with using hg locate instead. Just to make sure I have the right idea here, this would then require 4 hg process calls?
      I.e. To obtain added empty files, we need hg locate -r <base> and hg locate -r <tip> 'set:size(0)'. To obtain deleted empty files, we need hg locate -r <base> 'set:size(0)' and hg locate -r <tip>.

      Also, is there an inexpensive way to get the date when a changeset was committed (<date> in the diff below), besides using hg log with a template? I could just leave this date info out since it isn't needed for RB, but I'm inclined to follow the diff's usual format.

      diff -r <base_revision> -r <tip_revision> foo
      --- /dev/null\tThu Jan 01 00:00:00 1970 +0000
      +++ b/foo\t<date>
      
    2. In the minimal case, we'd have two process calls: hg locate -r <base> and hg locate -r <tip>. If the sets of files don't vary, we have no work to do. You would need an extra process each to obtain the set of 0-size adds and deletes, for a maximum of 4 process invocations.

      Using hg log with a custom template (or any other command for that matter) is the preferred way to extract data from Mercurial. Please note that the date() template function accepts a 2nd argument that controls the format (if you need it).

      FTR, the number of process invocations to obtain Mercurial metadata is quickly getting out of hand. IMO we should expand the Mercurial extension to provide Mercurial commands that give RBTools exactly what it wants. e.g. hg rbtdiff <base> <tip>. We can perform all the calculations inside the extension in one Mercurial context. That will make things much more efficient, especially for larger repos. Maybe I'll do this the next time I sit down to hack on RBtools.

  2. rbtools/clients/mercurial.py (Diff revision 5)
     
     

    For performance reasons, it's unfortunate we have to invoke hg more than once to obtain this data.

    That being said, this will only be an issue on very large repositories (like Firefox's). We already have enough performance problems with rbt on large repositories. I wouldn't worry too much about it.

  3. rbtools/clients/mercurial.py (Diff revision 5)
     
     
    Show all issues

    I dare say that parsing diffs should never be the answer.

    When you emailed me, I neglected to tell you about a Mercurial feature called filesets. (See http://www.selenic.com/hg/help/filesets). Essentially, any command that takes file arguments can take these fileset queries to dynamically query Mercurial.

    In the case of file adds, we can look for 0-size files in an arbitrary changeset:

    $ hg locate -r <rev> 'set:size(0)'

    So, you can take the output of hg log --template '{join(file_adds, "\n")}' -r A::B and intersect that set with the output of hg locate from above to find added empty files. This should perform much better than generating a diff (which is relatively slow for Mercurial). Alternatively, you can take the output of hg locate for the base and tip revisions and find added entries manually. This may be better than parsing hg log output.

  4. rbtools/clients/mercurial.py (Diff revision 5)
     
     
    Show all issues

    I would use hg locate to obtain the set of all files in each changeset. Take the set difference to obtain deleted files. Limit the query of the base revision with 'set:size(0)' to only show files that were empty initially.

  5. rbtools/clients/mercurial.py (Diff revision 5)
     
     
    Show all issues

    You should be using :: instead of : for the revision set. : is revision order (order changesets were added to Mercurial). :: is DAG order. : may include changesets from unrelated branches/heads.

  6. 
      
anselina
reviewbot
  1. This is a review from Review Bot.
      Tool: PEP8 Style Checker
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
reviewbot
  1. This is a review from Review Bot.
      Tool: Pyflakes
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
reviewbot
  1. This is a review from Review Bot.
      Tool: PEP8 Style Checker
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
reviewbot
  1. This is a review from Review Bot.
      Tool: Pyflakes
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
IN
  1. The Mercurial logic looks sound.

  2. rbtools/clients/mercurial.py (Diff revision 6)
     
     
    Show all issues

    I would return a set from _get_files_in_changeset() and use the overloaded operators on set.

    e.g. added_empty_files = tip_empty_files - base_files

  3. 
      
anselina
anselina
david
  1. What happens if you use rbt post with this code on a review board server that doesn't have /r/5785? Does it still parse the diffs OK?

    1. Added/deleted empty files in SVN and Mercurial diffs give the same error on the View Diff page:
      "There was an error displaying this diff.

      The patch to '<filename>' didn't apply cleanly. The temporary files have been left in '/tmp/reviewboard.Ufw0Ik' for debugging purposes. patch returned: patch: **** Only garbage was found in the patch input."

      For Perforce diffs, an added empty file gives the same error, but a deleted empty file shows up fine.

    2. OK. We probably want to add a server capability so that we don't generate broken diffs for old servers.

  2. 
      
anselina
reviewbot
  1. This is a review from Review Bot.
      Tool: PEP8 Style Checker
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
reviewbot
  1. This is a review from Review Bot.
      Tool: Pyflakes
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
david
  1. 
      
  2. rbtools/clients/perforce.py (Diff revision 8)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    This logic is kind of confusing. How about this:

    is_empty_and_changed = (self._supports_empty_files() and
                            changetype_short in ('A', 'D'))
    if dl == [] and (is_move or is_empty_and_changed):
        dl.insert(0, '==== %s#%s ==%s== %s ====\n' %
                  (depot_file, base_revision, changetype_short,
                  new_local_path))
        dl.append('\n')
    else:
        if ignore_unmodified:
            return []
        else:
            print 'Warning: %s in your changeset is unmodified' % \
                  local_path
    
  3. 
      
anselina
reviewbot
  1. This is a review from Review Bot.
      Tool: PEP8 Style Checker
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
reviewbot
  1. This is a review from Review Bot.
      Tool: Pyflakes
      Processed Files:
        rbtools/clients/svn.py
        rbtools/clients/mercurial.py
        rbtools/clients/perforce.py
      Ignored Files:
    
    
  2. 
      
david
  1. This looks good to me. Was there anything else you were planning on adding?

    One thing which should be tested (and might require a separate change) is whether empty files are correctly handled in 'rbt patch'

    1. That's it for this!

      rbt patch works as it did previously for SVN, Mercurial, and Perforce, so all empty files are simply ignored/unhandled. I'll look into dealing with empty files in rbt patch separately later.

  2. 
      
anselina
Review request changed
Status:
Completed
Change Summary:
Pushed to release-0.6.x (bfec7b3)