• 
      

    Fix interdiff changes with line reorders.

    Review Request #15055 — Created May 13, 2026 and submitted

    Information

    Review Board
    release-8.x

    Reviewers

    The new interdiff algorithm had an edge case where changes could get
    lost when reordering a sequence of lines. There were a few ways this
    could be triggered.

    An easy example would be to have a change in revision 1 that looks like:

    99   +...
    100   aaaa
    101  +bbbb
    102  +...
    

    And in revision 2:

    99   +...
    100  +bbbb
    101   aaaa
    102
    103  +...
    

    (Assume 0-based indexes above, instead of 1-based line numbers, to help
    the following comparisons make more sense.)

    The resulting unfiltered interdiff could look like:

      99   ...   |   99  ...
    - 100  aaaa  |
      101  bbbb  |  100  bbbb
    +            |  101  aaaa
      102  ...   |  102  ...
    

    In this case, the whole set of changes would be filtered out. The reason
    is that the computed ranges for side A would be [99,100) and side B
    would be [99,101). This wouldn't match the diff opcodes, which are:

    ("delete", 100, 101,  99, 99),
    ("equal",  101, 102, 100, 101),
    ("insert", 102, 102, 101, 102),
    

    That delete of [100,101) falls out of the range of [99,100), and
    the insert of [101,102) falls out of the range of [99,101). In the
    existing implementation, this would result in these being filtered out.

    In the cases above, note how the op starts at the exclusive end of
    the change ranges (which I refer to as the "boundary"). This is due to
    an insert becoming a delete or vice-versa.

    Issues can also occur for deletes in-between ranges related to the
    above cases, where, for example, aaaa would be surrounded by equal
    lines. The resulting interdiff could continue to move aaaa as above,
    but in a way where the intermediate delete would be lost.

    This has been addressed with some new heuristics that check the
    boundaries for shifted changes, and to check gaps between boundaries
    for a delete (which should be from the legitimate interdiff and not
    upstream changes). Guards were added that make sure we only do this if
    the previous change range (which may be the remnants of a split range)
    don't overlap with an opcode (indicating it may have shifted).

    This is done separately in Phase 1 and Phase 2 of the algorithm, and by
    keeping track of previous ranges for both sides of the diff and which
    ranges have already been entered once (a change range in a revision's
    diff that hasn't been used in the interdiff is a good candidate for
    something that shifted to/below the boundary).

    If starting on a boundary in Phase 1 (pre-range split), a split will be
    considered if the previous range hasn't yet been entered and there's a
    current range for processing. If the normally-filtered part starts
    right on the boundary, it will be emitted. This will catch the cases
    above. It may result in upstream changes not being filtered out, but it
    does a good job of catching these reorder issues.

    In Phase 2 (starts-in-range validity + cap), a check is then made if the
    op exists either at the boundary of a previously unused range, or
    sometime after it but with a change on the other side of the diff also
    in a gap between ranges. If so, it's emitted as a valid op, capped at
    the start of the next range, and with the remainder queued for
    re-processing.

    This should catch the cases most likely to trigger shifted lines from
    nearby line reorders without the risk of loss or bringing in much of the
    upstream changes (though this is always possible with these heuristics).

    All interdiff unit tests (using real datasets) pass.

    The original repro case is now fixed.

    Built a bunch of hand-crafted and tested diffs that exhibited variations
    on this issue, including them in my testing tool and in the unit tests.
    Verified that they're all showing the desired output.

    Haven't seen any regression with upstream changes, but we'll need more
    real-world testing.

    Summary ID
    Fix interdiff changes with line reorders.
    The new interdiff algorithm had an edge case where changes could get lost when reordering a sequence of lines. There were a few ways this could be triggered. An easy example would be to have a change in revision 1 that looks like: ``` 99 +... 100 aaaa 101 +bbbb 102 +... ``` And in revision 2: ``` 99 +... 100 +bbbb 101 aaaa 102 103 +... ``` (Assume 0-based indexes above, instead of 1-based line numbers, to help the following comparisons make more sense.) The resulting unfiltered interdiff could look like: ``` 99 ... | 99 ... - 100 aaaa | 101 bbbb | 100 bbbb + | 101 aaaa 102 ... | 102 ... ``` In this case, the whole set of changes would be filtered out. The reason is that the computed ranges for side A would be `[99,100)` and side B would be `[99,101)`. This wouldn't match the diff opcodes, which are: ``` ("delete", 100, 101, 99, 99), ("equal", 101, 102, 100, 101), ("insert", 102, 102, 101, 102), ``` That `delete` of `[100,101)` falls out of the range of `[99,100)`, and the `insert` of `[101,102)` falls out of the range of `[99,101)`. In the existing implementation, this would result in these being filtered out. Issues can also occur for `delete`s in-between ranges related to the above cases, where, for example, `aaaa` would be surrounded by `equal` lines. The resulting interdiff could continue to move `aaaa` as above, but in a way where the intermediate `delete` would be lost. This change adds boundary-aware heuristics to catch these kinds of problems. Now, we check for when an op starts on the boundary (at the exclusive end of the range). This is done separately in Phase 1 and Phase 2 of the algorithm, and by keeping track of previous ranges for both sides of the diff and which ranges have already been entered once (a change range in a revision's diff that hasn't been used in the interdiff is a good candidate for something that shifted to/below the boundary). If starting on a boundary in Phase 1 (pre-range split), a split will be considered if the previous range hasn't yet been entered and there's a current range for processing. If the normally-filtered part starts right on the boundary, it will be emitted. This will catch the cases above. It may result in upstream changes not being filtered out, but it does a good job of catching these reorder issues. In Phase 2 (starts-in-range validity + cap), a check is then made if the op exists either at the boundary of a previously unused range, or sometime after it but with a change on the other side of the diff also in a gap between ranges. If so, it's emitted as a valid op, capped at the start of the next range, and with the remainder queued for re-processing. This should catch the cases most likely to trigger shifted lines from nearby line reorders without the risk of loss or bringing in much of the upstream changes (though this is always possible with these heuristics).
    22ac26b8117cd242e8a76ab169026c1f2e47ad9c
    Description From Last Updated

    Language is weird here. How about causes -> happens?

    david david

    Seems like there's a missing "of" between "differ" and "the".

    maubin maubin
    chipx86
    chipx86
    david
    1. 
        
    2. reviewboard/diffviewer/processors.py (Diff revision 1)
       
       
      Show all issues

      Language is weird here. How about causes -> happens?

    3. 
        
    maubin
    1. LGTM?

    2. Show all issues

      Seems like there's a missing "of" between "differ" and "the".

    3. 
        
    chipx86
    Review request changed
    Status:
    Completed
    Change Summary:
    Pushed to release-8.x (4c752fc)