Summary

Fix interdiff changes with line reorders.

Review Request #15055 — Created May 13, 2026 and submitted May 18, 2026, 3:25 p.m.

Information

Owner

chipx86

Repository

Review Board

Branch

release-8.x

Bugs

Depends On

Reviewers

Groups

reviewboard

People

Description

The new interdiff algorithm had an edge case where changes could get
lost when reordering a sequence of lines. There were a few ways this
could be triggered.

An easy example would be to have a change in revision 1 that looks like:

99   +...
100   aaaa
101  +bbbb
102  +...

And in revision 2:

99   +...
100  +bbbb
101   aaaa
102
103  +...

(Assume 0-based indexes above, instead of 1-based line numbers, to help
the following comparisons make more sense.)

The resulting unfiltered interdiff could look like:

  99   ...   |   99  ...
- 100  aaaa  |
  101  bbbb  |  100  bbbb
+            |  101  aaaa
  102  ...   |  102  ...

In this case, the whole set of changes would be filtered out. The reason
is that the computed ranges for side A would be [99,100) and side B
would be [99,101). This wouldn't match the diff opcodes, which are:

("delete", 100, 101,  99, 99),
("equal",  101, 102, 100, 101),
("insert", 102, 102, 101, 102),

That delete of [100,101) falls out of the range of [99,100), and
the insert of [101,102) falls out of the range of [99,101). In the
existing implementation, this would result in these being filtered out.

In the cases above, note how the op starts at the exclusive end of
the change ranges (which I refer to as the "boundary"). This is due to
an insert becoming a delete or vice-versa.

Issues can also occur for deletes in-between ranges related to the
above cases, where, for example, aaaa would be surrounded by equal
lines. The resulting interdiff could continue to move aaaa as above,
but in a way where the intermediate delete would be lost.

This has been addressed with some new heuristics that check the
boundaries for shifted changes, and to check gaps between boundaries
for a delete (which should be from the legitimate interdiff and not
upstream changes). Guards were added that make sure we only do this if
the previous change range (which may be the remnants of a split range)
don't overlap with an opcode (indicating it may have shifted).

This is done separately in Phase 1 and Phase 2 of the algorithm, and by
keeping track of previous ranges for both sides of the diff and which
ranges have already been entered once (a change range in a revision's
diff that hasn't been used in the interdiff is a good candidate for
something that shifted to/below the boundary).

If starting on a boundary in Phase 1 (pre-range split), a split will be
considered if the previous range hasn't yet been entered and there's a
current range for processing. If the normally-filtered part starts
right on the boundary, it will be emitted. This will catch the cases
above. It may result in upstream changes not being filtered out, but it
does a good job of catching these reorder issues.

In Phase 2 (starts-in-range validity + cap), a check is then made if the
op exists either at the boundary of a previously unused range, or
sometime after it but with a change on the other side of the diff also
in a gap between ranges. If so, it's emitted as a valid op, capped at
the start of the next range, and with the remainder queued for
re-processing.

This should catch the cases most likely to trigger shifted lines from
nearby line reorders without the risk of loss or bringing in much of the
upstream changes (though this is always possible with these heuristics).

Testing Done

All interdiff unit tests (using real datasets) pass.

The original repro case is now fixed.

Built a bunch of hand-crafted and tested diffs that exhibited variations
on this issue, including them in my testing tool and in the unit tests.
Verified that they're all showing the desired output.

Haven't seen any regression with upstream changes, but we'll need more
real-world testing.

Commits

Summary	ID
Fix interdiff changes with line reorders. The new interdiff algorithm had an edge case where changes could get lost when reordering a sequence of lines. There were a few ways this could be triggered. An easy example would be to have a change in revision 1 that looks like: ``` 99 +... 100 aaaa 101 +bbbb 102 +... ``` And in revision 2: ``` 99 +... 100 +bbbb 101 aaaa 102 103 +... ``` (Assume 0-based indexes above, instead of 1-based line numbers, to help the following comparisons make more sense.) The resulting unfiltered interdiff could look like: ``` 99 ... \| 99 ... - 100 aaaa \| 101 bbbb \| 100 bbbb + \| 101 aaaa 102 ... \| 102 ... ``` In this case, the whole set of changes would be filtered out. The reason is that the computed ranges for side A would be `[99,100)` and side B would be `[99,101)`. This wouldn't match the diff opcodes, which are: ``` ("delete", 100, 101, 99, 99), ("equal", 101, 102, 100, 101), ("insert", 102, 102, 101, 102), ``` That `delete` of `[100,101)` falls out of the range of `[99,100)`, and the `insert` of `[101,102)` falls out of the range of `[99,101)`. In the existing implementation, this would result in these being filtered out. Issues can also occur for `delete`s in-between ranges related to the above cases, where, for example, `aaaa` would be surrounded by `equal` lines. The resulting interdiff could continue to move `aaaa` as above, but in a way where the intermediate `delete` would be lost. This change adds boundary-aware heuristics to catch these kinds of problems. Now, we check for when an op starts on the boundary (at the exclusive end of the range). This is done separately in Phase 1 and Phase 2 of the algorithm, and by keeping track of previous ranges for both sides of the diff and which ranges have already been entered once (a change range in a revision's diff that hasn't been used in the interdiff is a good candidate for something that shifted to/below the boundary). If starting on a boundary in Phase 1 (pre-range split), a split will be considered if the previous range hasn't yet been entered and there's a current range for processing. If the normally-filtered part starts right on the boundary, it will be emitted. This will catch the cases above. It may result in upstream changes not being filtered out, but it does a good job of catching these reorder issues. In Phase 2 (starts-in-range validity + cap), a check is then made if the op exists either at the boundary of a previously unused range, or sometime after it but with a change on the other side of the diff also in a gap between ranges. If so, it's emitted as a valid op, capped at the start of the next range, and with the remainder queued for re-processing. This should catch the cases most likely to trigger shifted lines from nearby line reorders without the risk of loss or bringing in much of the upstream changes (though this is always possible with these heuristics).	22ac26b8117cd242e8a76ab169026c1f2e47ad9c

Summary

Fix interdiff changes with line reorders.

The new interdiff algorithm had an edge case where changes could get lost when reordering a sequence of lines. There were a few ways this could be triggered. An easy example would be to have a change in revision 1 that looks like: ``` 99 +... 100 aaaa 101 +bbbb 102 +... ``` And in revision 2: ``` 99 +... 100 +bbbb 101 aaaa 102 103 +... ``` (Assume 0-based indexes above, instead of 1-based line numbers, to help the following comparisons make more sense.) The resulting unfiltered interdiff could look like: ``` 99 ... | 99 ... - 100 aaaa | 101 bbbb | 100 bbbb + | 101 aaaa 102 ... | 102 ... ``` In this case, the whole set of changes would be filtered out. The reason is that the computed ranges for side A would be `[99,100)` and side B would be `[99,101)`. This wouldn't match the diff opcodes, which are: ``` ("delete", 100, 101, 99, 99), ("equal", 101, 102, 100, 101), ("insert", 102, 102, 101, 102), ``` That `delete` of `[100,101)` falls out of the range of `[99,100)`, and the `insert` of `[101,102)` falls out of the range of `[99,101)`. In the existing implementation, this would result in these being filtered out. Issues can also occur for `delete`s in-between ranges related to the above cases, where, for example, `aaaa` would be surrounded by `equal` lines. The resulting interdiff could continue to move `aaaa` as above, but in a way where the intermediate `delete` would be lost. This change adds boundary-aware heuristics to catch these kinds of problems. Now, we check for when an op starts on the boundary (at the exclusive end of the range). This is done separately in Phase 1 and Phase 2 of the algorithm, and by keeping track of previous ranges for both sides of the diff and which ranges have already been entered once (a change range in a revision's diff that hasn't been used in the interdiff is a good candidate for something that shifted to/below the boundary). If starting on a boundary in Phase 1 (pre-range split), a split will be considered if the previous range hasn't yet been entered and there's a current range for processing. If the normally-filtered part starts right on the boundary, it will be emitted. This will catch the cases above. It may result in upstream changes not being filtered out, but it does a good job of catching these reorder issues. In Phase 2 (starts-in-range validity + cap), a check is then made if the op exists either at the boundary of a previously unused range, or sometime after it but with a change on the other side of the diff also in a gap between ranges. If so, it's emitted as a valid op, capped at the start of the next range, and with the remainder queued for re-processing. This should catch the cases most likely to trigger shifted lines from nearby line reorders without the risk of loss or bringing in much of the upstream changes (though this is always possible with these heuristics).

22ac26b8117cd242e8a76ab169026c1f2e47ad9c

Issues

Description	From	Last Updated
Language is weird here. How about causes -> happens?	david	May 14, 2026, 1:25 a.m.
Seems like there's a missing "of" between "differ" and "the".	maubin	May 14, 2026, 2:19 p.m.

flake8 passed.

JSHint passed.

Description:: The new interdiff algorithm had an edge case where changes could get
lost when reordering a sequence of lines. There were a few ways this
could be triggered.

An easy example would be to have a change in revision 1 that looks like:

99 +...

100 aaaa

101 +bbbb

102 +...

And in revision 2:

99 +...

100 +bbbb

101 aaaa

102

103 +...

(Assume 0-based indexes above, instead of 1-based line numbers, to help
the following comparisons make more sense.)

The resulting unfiltered interdiff could look like:

99 ... | 99 ...

- 100 aaaa |

101 bbbb | 100 bbbb

+ | 101 aaaa

102 ... | 102 ...

In this case, the whole set of changes would be filtered out. The reason
is that the computed ranges for side A would be [99,100) and side B
would be [99,101). This wouldn't match the diff opcodes, which are:

("delete", 100, 101, 99, 99),

("equal", 101, 102, 100, 101),

("insert", 102, 102, 101, 102),

That delete of [100,101) falls out of the range of [99,100), and
the insert of [101,102) falls out of the range of [99,101). In the
existing implementation, this would result in these being filtered out.

+
In the cases above, note how the op starts at the exclusive end of
+ the change ranges (which I refer to as the "boundary"). This is due to
+ an insert becoming a delete or vice-versa.
+

Issues can also occur for deletes in-between ranges related to the
above cases, where, for example, aaaa would be surrounded by equal
lines. The resulting interdiff could continue to move aaaa as above,
but in a way where the intermediate delete would be lost.

This change adds boundary-aware heuristics to catch these kinds of
problems. Now, we check for when an op starts on the boundary (at the
exclusive end of the range). This is done separately in Phase 1 and
Phase 2 of the algorithm, and by keeping track of previous ranges for
both sides of the diff and which ranges have already been entered once
(a change range in a revision's diff that hasn't been used in the
interdiff is a good candidate for something that shifted to/below the
boundary).

If starting on a boundary in Phase 1 (pre-range split), a split will be
considered if the previous range hasn't yet been entered and there's a
current range for processing. If the normally-filtered part starts
right on the boundary, it will be emitted. This will catch the cases
above. It may result in upstream changes not being filtered out, but it
does a good job of catching these reorder issues.

In Phase 2 (starts-in-range validity + cap), a check is then made if the
op exists either at the boundary of a previously unused range, or
sometime after it but with a change on the other side of the diff also
in a gap between ranges. If so, it's emitted as a valid op, capped at
the start of the next range, and with the remainder queued for
re-processing.

This should catch the cases most likely to trigger shifted lines from
nearby line reorders without the risk of loss or bringing in much of the
upstream changes (though this is always possible with these heuristics).
Testing Done:: +
All interdiff unit tests (using real datasets) pass.
+
+
The original repro case is now fixed.
+
+
Built a bunch of hand-crafted and tested diffs that exhibited variations
+ on this issue, including them in my testing tool and in the unit tests.
+ Verified that they're all showing the desired output.
+
+
Haven't seen any regression with upstream changes, but we'll need more
+ real-world testing.

Description:: The new interdiff algorithm had an edge case where changes could get
lost when reordering a sequence of lines. There were a few ways this
could be triggered.

An easy example would be to have a change in revision 1 that looks like:

99 +...

100 aaaa

101 +bbbb

102 +...

And in revision 2:

99 +...

100 +bbbb

101 aaaa

102

103 +...

(Assume 0-based indexes above, instead of 1-based line numbers, to help
the following comparisons make more sense.)

The resulting unfiltered interdiff could look like:

99 ... | 99 ...

- 100 aaaa |

101 bbbb | 100 bbbb

+ | 101 aaaa

102 ... | 102 ...

In this case, the whole set of changes would be filtered out. The reason
is that the computed ranges for side A would be [99,100) and side B
would be [99,101). This wouldn't match the diff opcodes, which are:

("delete", 100, 101, 99, 99),

("equal", 101, 102, 100, 101),

("insert", 102, 102, 101, 102),

That delete of [100,101) falls out of the range of [99,100), and
the insert of [101,102) falls out of the range of [99,101). In the
existing implementation, this would result in these being filtered out.

In the cases above, note how the op starts at the exclusive end of
the change ranges (which I refer to as the "boundary"). This is due to
an insert becoming a delete or vice-versa.

Issues can also occur for deletes in-between ranges related to the
above cases, where, for example, aaaa would be surrounded by equal
lines. The resulting interdiff could continue to move aaaa as above,
but in a way where the intermediate delete would be lost.

~
This change adds boundary-aware heuristics to catch these kinds of
~ problems. Now, we check for when an op starts on the boundary (at the
~ exclusive end of the range). This is done separately in Phase 1 and
~ Phase 2 of the algorithm, and by keeping track of previous ranges for
~ both sides of the diff and which ranges have already been entered once
~ (a change range in a revision's diff that hasn't been used in the
~ interdiff is a good candidate for something that shifted to/below the
~ boundary).
~
This has been addressed with some new heuristics that check the
~ boundaries for shifted changes, and to check gaps between boundaries
~ for a delete (which should be from the legitimate interdiff and not
~ upstream changes). Guards were added that make sure we only do this if
~ the previous change range (which may be the remnants of a split range)
~ don't overlap with an opcode (indicating it may have shifted).
~
~
This is done separately in Phase 1 and Phase 2 of the algorithm, and by
+ keeping track of previous ranges for both sides of the diff and which
+ ranges have already been entered once (a change range in a revision's
+ diff that hasn't been used in the interdiff is a good candidate for
+ something that shifted to/below the boundary).

If starting on a boundary in Phase 1 (pre-range split), a split will be
considered if the previous range hasn't yet been entered and there's a
current range for processing. If the normally-filtered part starts
right on the boundary, it will be emitted. This will catch the cases
above. It may result in upstream changes not being filtered out, but it
does a good job of catching these reorder issues.

In Phase 2 (starts-in-range validity + cap), a check is then made if the
op exists either at the boundary of a previously unused range, or
sometime after it but with a change on the other side of the diff also
in a gap between ranges. If so, it's emitted as a valid op, capped at
the start of the next range, and with the remainder queued for
re-processing.

This should catch the cases most likely to trigger shifted lines from
nearby line reorders without the risk of loss or bringing in much of the
upstream changes (though this is always possible with these heuristics).

Ship it!

reviewboard/diffviewer/processors.py (Diff revision 1)
The issue has been resolved. Show all issues
```
Language is weird here. How about causes -> happens?
```

Ship it!

```
LGTM?
```
reviewboard/diffviewer/tests/test_processors.py (Diff revision 1)
The issue has been resolved. Show all issues
```
Seems like there's a missing "of" between "differ" and "the".
```

Status:: Completed
Change Summary:: Pushed to release-8.x (4c752fc)

		The new interdiff algorithm had an edge case where changes could get
		lost when reordering a sequence of lines. There were a few ways this
		could be triggered.

		An easy example would be to have a change in revision 1 that looks like:



		`99 +...`
		100 aaaa
		101 +bbbb
		102 +...




		And in revision 2:



		`99 +...`
		100 +bbbb
		101 aaaa
		102
		103 +...




		(Assume 0-based indexes above, instead of 1-based line numbers, to help
		the following comparisons make more sense.)

		The resulting unfiltered interdiff could look like:



		`99 ... \| 99 ...`
		- 100 aaaa \|
		101 bbbb \| 100 bbbb
		+ \| 101 aaaa
		102 ... \| 102 ...




		In this case, the whole set of changes would be filtered out. The reason
		is that the computed ranges for side A would be `[99,100)` and side B
		would be `[99,101)`. This wouldn't match the diff opcodes, which are:



		`("delete", 100, 101, 99, 99),`
		("equal", 101, 102, 100, 101),
		("insert", 102, 102, 101, 102),




		That `delete` of `[100,101)` falls out of the range of `[99,100)`, and
		the `insert` of `[101,102)` falls out of the range of `[99,101)`. In the
		existing implementation, this would result in these being filtered out.

	+	In the cases above, note how the op starts at the exclusive end of
	+	the change ranges (which I refer to as the "boundary"). This is due to
	+	an `insert` becoming a `delete` or vice-versa.
	+
		Issues can also occur for `delete`s in-between ranges related to the
		above cases, where, for example, `aaaa` would be surrounded by `equal`
		lines. The resulting interdiff could continue to move `aaaa` as above,
		but in a way where the intermediate `delete` would be lost.

		This change adds boundary-aware heuristics to catch these kinds of
		problems. Now, we check for when an op starts on the boundary (at the
		exclusive end of the range). This is done separately in Phase 1 and
		Phase 2 of the algorithm, and by keeping track of previous ranges for
		both sides of the diff and which ranges have already been entered once
		(a change range in a revision's diff that hasn't been used in the
		interdiff is a good candidate for something that shifted to/below the
		boundary).

		If starting on a boundary in Phase 1 (pre-range split), a split will be
		considered if the previous range hasn't yet been entered and there's a
		current range for processing. If the normally-filtered part starts
		right on the boundary, it will be emitted. This will catch the cases
		above. It may result in upstream changes not being filtered out, but it
		does a good job of catching these reorder issues.

		In Phase 2 (starts-in-range validity + cap), a check is then made if the
		op exists either at the boundary of a previously unused range, or
		sometime after it but with a change on the other side of the diff also
		in a gap between ranges. If so, it's emitted as a valid op, capped at
		the start of the next range, and with the remainder queued for
		re-processing.

		This should catch the cases most likely to trigger shifted lines from
		nearby line reorders without the risk of loss or bringing in much of the
		upstream changes (though this is always possible with these heuristics).