Fix interdiff changes with line reorders.
Review Request #15055 — Created May 13, 2026 and submitted
The new interdiff algorithm had an edge case where changes could get
lost when reordering a sequence of lines. There were a few ways this
could be triggered.An easy example would be to have a change in revision 1 that looks like:
99 +... 100 aaaa 101 +bbbb 102 +...And in revision 2:
99 +... 100 +bbbb 101 aaaa 102 103 +...(Assume 0-based indexes above, instead of 1-based line numbers, to help
the following comparisons make more sense.)The resulting unfiltered interdiff could look like:
99 ... | 99 ... - 100 aaaa | 101 bbbb | 100 bbbb + | 101 aaaa 102 ... | 102 ...In this case, the whole set of changes would be filtered out. The reason
is that the computed ranges for side A would be[99,100)and side B
would be[99,101). This wouldn't match the diff opcodes, which are:("delete", 100, 101, 99, 99), ("equal", 101, 102, 100, 101), ("insert", 102, 102, 101, 102),That
deleteof[100,101)falls out of the range of[99,100), and
theinsertof[101,102)falls out of the range of[99,101). In the
existing implementation, this would result in these being filtered out.In the cases above, note how the op starts at the exclusive end of
the change ranges (which I refer to as the "boundary"). This is due to
aninsertbecoming adeleteor vice-versa.Issues can also occur for
deletes in-between ranges related to the
above cases, where, for example,aaaawould be surrounded byequal
lines. The resulting interdiff could continue to moveaaaaas above,
but in a way where the intermediatedeletewould be lost.This has been addressed with some new heuristics that check the
boundaries for shifted changes, and to check gaps between boundaries
for adelete(which should be from the legitimate interdiff and not
upstream changes). Guards were added that make sure we only do this if
the previous change range (which may be the remnants of a split range)
don't overlap with an opcode (indicating it may have shifted).This is done separately in Phase 1 and Phase 2 of the algorithm, and by
keeping track of previous ranges for both sides of the diff and which
ranges have already been entered once (a change range in a revision's
diff that hasn't been used in the interdiff is a good candidate for
something that shifted to/below the boundary).If starting on a boundary in Phase 1 (pre-range split), a split will be
considered if the previous range hasn't yet been entered and there's a
current range for processing. If the normally-filtered part starts
right on the boundary, it will be emitted. This will catch the cases
above. It may result in upstream changes not being filtered out, but it
does a good job of catching these reorder issues.In Phase 2 (starts-in-range validity + cap), a check is then made if the
op exists either at the boundary of a previously unused range, or
sometime after it but with a change on the other side of the diff also
in a gap between ranges. If so, it's emitted as a valid op, capped at
the start of the next range, and with the remainder queued for
re-processing.This should catch the cases most likely to trigger shifted lines from
nearby line reorders without the risk of loss or bringing in much of the
upstream changes (though this is always possible with these heuristics).
All interdiff unit tests (using real datasets) pass.
The original repro case is now fixed.
Built a bunch of hand-crafted and tested diffs that exhibited variations
on this issue, including them in my testing tool and in the unit tests.
Verified that they're all showing the desired output.Haven't seen any regression with upstream changes, but we'll need more
real-world testing.
- Description:
-
The new interdiff algorithm had an edge case where changes could get
lost when reordering a sequence of lines. There were a few ways this could be triggered. An easy example would be to have a change in revision 1 that looks like:
99 +...100 aaaa
101 +bbbb
102 +...
And in revision 2:
99 +...100 +bbbb
101 aaaa
102
103 +...
(Assume 0-based indexes above, instead of 1-based line numbers, to help
the following comparisons make more sense.) The resulting unfiltered interdiff could look like:
99 ... | 99 ...- 100 aaaa |
101 bbbb | 100 bbbb
+ | 101 aaaa
102 ... | 102 ...
In this case, the whole set of changes would be filtered out. The reason
is that the computed ranges for side A would be [99,100)and side Bwould be [99,101). This wouldn't match the diff opcodes, which are:("delete", 100, 101, 99, 99),("equal", 101, 102, 100, 101),("insert", 102, 102, 101, 102),That
deleteof[100,101)falls out of the range of[99,100), andthe insertof[101,102)falls out of the range of[99,101). In theexisting implementation, this would result in these being filtered out. + In the cases above, note how the op starts at the exclusive end of
+ the change ranges (which I refer to as the "boundary"). This is due to + an insertbecoming adeleteor vice-versa.+ Issues can also occur for
deletes in-between ranges related to theabove cases, where, for example, aaaawould be surrounded byequallines. The resulting interdiff could continue to move aaaaas above,but in a way where the intermediate deletewould be lost.This change adds boundary-aware heuristics to catch these kinds of
problems. Now, we check for when an op starts on the boundary (at the exclusive end of the range). This is done separately in Phase 1 and Phase 2 of the algorithm, and by keeping track of previous ranges for both sides of the diff and which ranges have already been entered once (a change range in a revision's diff that hasn't been used in the interdiff is a good candidate for something that shifted to/below the boundary). If starting on a boundary in Phase 1 (pre-range split), a split will be
considered if the previous range hasn't yet been entered and there's a current range for processing. If the normally-filtered part starts right on the boundary, it will be emitted. This will catch the cases above. It may result in upstream changes not being filtered out, but it does a good job of catching these reorder issues. In Phase 2 (starts-in-range validity + cap), a check is then made if the
op exists either at the boundary of a previously unused range, or sometime after it but with a change on the other side of the diff also in a gap between ranges. If so, it's emitted as a valid op, capped at the start of the next range, and with the remainder queued for re-processing. This should catch the cases most likely to trigger shifted lines from
nearby line reorders without the risk of loss or bringing in much of the upstream changes (though this is always possible with these heuristics). - Testing Done:
-
+ All interdiff unit tests (using real datasets) pass.
+ + The original repro case is now fixed.
+ + Built a bunch of hand-crafted and tested diffs that exhibited variations
+ on this issue, including them in my testing tool and in the unit tests. + Verified that they're all showing the desired output. + + Haven't seen any regression with upstream changes, but we'll need more
+ real-world testing.
- Description:
-
The new interdiff algorithm had an edge case where changes could get
lost when reordering a sequence of lines. There were a few ways this could be triggered. An easy example would be to have a change in revision 1 that looks like:
99 +...100 aaaa
101 +bbbb
102 +...
And in revision 2:
99 +...100 +bbbb
101 aaaa
102
103 +...
(Assume 0-based indexes above, instead of 1-based line numbers, to help
the following comparisons make more sense.) The resulting unfiltered interdiff could look like:
99 ... | 99 ...- 100 aaaa |
101 bbbb | 100 bbbb
+ | 101 aaaa
102 ... | 102 ...
In this case, the whole set of changes would be filtered out. The reason
is that the computed ranges for side A would be [99,100)and side Bwould be [99,101). This wouldn't match the diff opcodes, which are:("delete", 100, 101, 99, 99),("equal", 101, 102, 100, 101),("insert", 102, 102, 101, 102),That
deleteof[100,101)falls out of the range of[99,100), andthe insertof[101,102)falls out of the range of[99,101). In theexisting implementation, this would result in these being filtered out. In the cases above, note how the op starts at the exclusive end of
the change ranges (which I refer to as the "boundary"). This is due to an insertbecoming adeleteor vice-versa.Issues can also occur for
deletes in-between ranges related to theabove cases, where, for example, aaaawould be surrounded byequallines. The resulting interdiff could continue to move aaaaas above,but in a way where the intermediate deletewould be lost.~ This change adds boundary-aware heuristics to catch these kinds of
~ problems. Now, we check for when an op starts on the boundary (at the ~ exclusive end of the range). This is done separately in Phase 1 and ~ Phase 2 of the algorithm, and by keeping track of previous ranges for ~ both sides of the diff and which ranges have already been entered once ~ (a change range in a revision's diff that hasn't been used in the ~ interdiff is a good candidate for something that shifted to/below the ~ boundary). ~ This has been addressed with some new heuristics that check the
~ boundaries for shifted changes, and to check gaps between boundaries ~ for a delete(which should be from the legitimate interdiff and not~ upstream changes). Guards were added that make sure we only do this if ~ the previous change range (which may be the remnants of a split range) ~ don't overlap with an opcode (indicating it may have shifted). ~ ~ This is done separately in Phase 1 and Phase 2 of the algorithm, and by
+ keeping track of previous ranges for both sides of the diff and which + ranges have already been entered once (a change range in a revision's + diff that hasn't been used in the interdiff is a good candidate for + something that shifted to/below the boundary). If starting on a boundary in Phase 1 (pre-range split), a split will be
considered if the previous range hasn't yet been entered and there's a current range for processing. If the normally-filtered part starts right on the boundary, it will be emitted. This will catch the cases above. It may result in upstream changes not being filtered out, but it does a good job of catching these reorder issues. In Phase 2 (starts-in-range validity + cap), a check is then made if the
op exists either at the boundary of a previously unused range, or sometime after it but with a change on the other side of the diff also in a gap between ranges. If so, it's emitted as a valid op, capped at the start of the next range, and with the remainder queued for re-processing. This should catch the cases most likely to trigger shifted lines from
nearby line reorders without the risk of loss or bringing in much of the upstream changes (though this is always possible with these heuristics).