Get parent diffs working with Mercurial.

Review Request #841 — Created April 28, 2009 and submitted

Review Board SVN (deprecated)
The existing parent diff code apparently assumes that each version of a file has an ID independent of the ID for the whole revision. This doesn't work for Mercurial because Mercurial only has revision IDs, not file version IDs.
This fix alters the behaviour of parent diffs with Mercurial to use the source ID of the parent diff for all files, not just files that are in the parent diff. I've added a method to SCMTool to ensure that the behaviour with git (or any other tool) is not affected.
Tested various diffs between revisions that are not in the master repository, including ones with changed files, added files and removed files.
  1. It's going to be important to test this with Git before this goes in.
    I don't fully get this change though. If these are indeed atomic revisions, shouldn't the same revision be listed for each file in a diff, rather than each file having its own?
    1. Yes this is correct. Mercurial diffs between two changesets always have the same source revision for every file (i.e. that of the first changeset), regardless of when that file last changed. Not sure why it repeats this information for every file, but it does.
      I agree it needs to be tested with Git. I don't have a Git installation but I could try and find time to set one up and do a quick sanity check. Or maybe it would be better for someone more familiar with Git to test it within their environment?
    2. After thinking about this a bit, I realize we do need two concepts. Otherwise I imagine we're going to be breaking Subversion, since the origInfo field for a diff will have different revisions for the source files being modified, and we have uses_atomic_revisions set, so this will end up altering the revisions stored.
      I'm also wondering if a better solution may be to just normalize the origInfo field for Mercurial diffs when processing the diffs in HgDiffParser. What's stored there right now? If fetching files from diffs was already working, what's making it not work for parent diffs?
      Just trying to understand this some more and make sure we have the right model. I don't really feel like we do yet.
    3. I'm going to try to illustrate what I think is going on in the different tools, so we can figure out what concepts are needed and how to make parent diffs work everywhere. My understanding of the tools other than Mercurial is limited so I may get some of this wrong.
      The terms I'll use are:
      Base Commit: The most recent commit that exists in both the local repo and the upstream repo.
      Parent Commit: The commit against which the new work is being compared (which I'm assuming is not the same as Base; if it is a parent diff is not needed).
      New Commit: The commit containing the new work being reviewed.
      This example uses the following files:
      filea.txt: Changed between Base and Parent, and again between Parent and New.
      fileb.txt: Changed between Parent and New but not between Base and Parent.
      filec.txt: Newly created in Parent, changed in New, not present in Base.
      filed.txt: Newly created in New, not present in Base or Parent.
      I'll use numbers for changeset IDs and letters for file version IDs; obviously in reality these will be big long hash values for the distributed SCM tools.
      Base: rev 1		Parent: rev 2		New: rev 3
      filea.txt: rev 1	filea.txt: rev 2	filea.txt: rev 3
      fileb.txt: rev 1	fileb.txt: rev 2	filec.txt: rev 3
      			filec.txt: rev 2	filec.txt: rev 3
      						filed.txt: rev 3
      So in the case of Mercurial, as far as the diffs are concerned the file version ID _is_ the changeset ID. Even though fileb.txt hasn't changed between Base and Parent, "hg diff" uses 2 as its old ID and 3 as its new ID since it doesn't care that it wasn't actually changed in 2.
      Base: rev 1		Parent: rev 2		New: rev 3
      filea.txt: rev A	filea.txt: rev B	filea.txt: rev C
      fileb.txt: rev A	fileb.txt: rev A	filec.txt: rev B
      			filec.txt: rev A	filec.txt: rev B
      						filed.txt: rev A
      Subversion (please correct me if I'm wrong here):
      Base: rev 1		Parent: rev 2		New: rev 3
      filea.txt: rev 1	filea.txt: rev 2	filea.txt: rev 3
      fileb.txt: rev 1	fileb.txt: rev 1	filec.txt: rev 3
      			filec.txt: rev 2	filec.txt: rev 3
      						filed.txt: rev 3
      My understanding is that the version ID for a file in Subversion is the ID of the changeset in which it was last modified; is this correct? (I'm not sure how relevant this actually is since Subversion is not a distributed tool so probably would never need to use parent diffs).
      So... what was failing before my patch was the reconstruction of the parent file, particularly in the case of fileb.txt. ReviewBoard was assuming Git-like file version IDs, so having seen that fileb.txt was not referenced in the parent diff it expected the version used in the Parent commit (version A in the Git scenario) would also be present in the Base commit.
      This would also work for Subversion, in theory, because the parent version is 1, which is present in base. It doesn't work for Mercurial though because fileb.txt in Parent looks like it has a different ID from fileb.txt in Base, even though it didn't actually change. So ReviewBoard tries to fetch fileb.txt revision 2, which it can't because revision 2 is not in any repository that it can see.
      So boiling this down, I reckon the property that is true for Mercurial but not true for the others is something like diff_uses_changeset_ids; i.e. the revision numbers in the diff files refers to the entire changeset, not individual files.
      It would certainly be nice to normalize this so that we only have one changeset ID instead of IDs for each file that happen to be identical, I'm not sure how easy this would be to implement though. You'd probably have to devolve more functionality to the SCMTool class so that this kind of thing can be more specialized for each tool. If Mercurial is the only tool that works like this then my quick hack might be the simplest way to resolve the issue.
      Does this make sense...?
    4. That clears a lot of things up, thanks!
      I'll have a few small remaining issues in the diff to fix up and then we'll get it in for RC3.
  2. /trunk/reviewboard/diffviewer/ (Diff revision 1)
    I was going to say that this seems wrong to me, since it means we're clobbering the parent revision each time, but then I realized it's only for atomic commits. Can you make a note saying that, and also indicating above where you first declare parent_rev that it's only used for atomic commits?
  3. /trunk/reviewboard/diffviewer/ (Diff revision 1)
    This should check tool.uses_atomic_revisions instead.
    Also, blank line before this conditional, and since this is a multi-line conditional, use parens instead of \
    1. OK. I somehow didn't notice uses_atomic_revisions.
      Presumably this property implies that the tool uses atomic revisions _instead_ of file version IDs, as opposed to _as well_? For example Git uses atomic revisions but also has file IDs, so this patch would not work for Git.
    2. Hmm, good point. The atomic revisions are used to indicate whether, well, the SCM has one revision per change, or per file committed. It's whatever is used for looking up that particular change on the file. In reality, nothing uses our tool.uses_atomic_revisions flag right now, so this would be the first usage of it. I would say that since we'd be using this for looking up information per-file, then Git wouldn't want to have this set.
      Maybe we want two concepts, a uses_atomic_file_revisions and uses_atomic_commit_revisions. Or something like that.
  4. /trunk/reviewboard/scmtools/ (Diff revision 1)
    As mentioned above, we shouldn't have this. Instead, hg should make sure to set uses_atomic_revisions in the constructor (which it currently doesn't do).
  2. /trunk/reviewboard/diffviewer/ (Diff revision 3)
    Excess whitespace.
    1. Fixed. I wish Visual Studio would check for those...
  3. /trunk/reviewboard/diffviewer/ (Diff revision 3)
    Won't we hit this case with parent diffs? Presumably the file would be in the list, and we'd end up using parent_file.origInfo, which I imagine would be incorrect.
    1. Technically it does seem incorrect, but source_rev has to be a revision that's in ReviewBoard's repository; if it's one that's only in the developer's repo then RB can't get it and you see exception text in place of the diff. It works well the way it is in all the cases I've tested.
  4. /trunk/reviewboard/diffviewer/ (Diff revision 3)
    f.origInfo should be aligned with tool.diff_uses_changeset_ids.
    Can you do one check per line? So parent_cset would be on its own line.
  5. /trunk/reviewboard/diffviewer/ (Diff revision 3)
    I'd actually prefer that we spell it out. "origChangesetId" in both file and info. I know we're not super consistent there, but "Cset" isn't a very common abbreviation.
    Should also use this in for parent_cset. parent_changeset_id.
  6. /trunk/reviewboard/scmtools/ (Diff revision 3)
    Blank line after blocks, so before "self.uses_atomic_revisions"
Review request changed

Change Summary:

Made suggested fixes.

Note I've updated my repo since posting the first three diffs, so if you look at the diff between 3 and 4 you'll see some changes that are not part of this fix.


Revision 4 (+25 -2)

Show changes

  1. Awesome, thanks! Committed as r1984.