• 
      

    Introduce the Interdiff Filtering Algorithm v3.

    Review Request #15022 — Created April 27, 2026 and updated — Latest diff uploaded

    Information

    Review Board
    release-8.x

    Reviewers

    We've had interdiff filtering for 20 years now, and have gone through a
    couple of major revisions of the algorithm. The latest, v2, has been
    around the longest, but had a few edge cases that could sometimes lead
    to over-filtered results or longer stretches of changes that didn't
    accurately filter out all of the upstream changes.

    This change introduces the new version 3 of the algorithm, with a few
    primary differences:

    1. We no longer parse diff hunks to gather changed range information.
      Instead, we generate our own using our differ and settings, ensuring
      the algorithms used for placement is the same between the interdiff
      and the change regions.

      Diff algorithms may choose different strategies when deciding which
      of a set of similar lines should be inserted or deleted. Braces are a
      good example. The addition of a couple braces and a line of code
      within could end up appearing after an existing closing brace or
      before an existing opening brace.

      By choosing the same algorithm across the board, we should ensure the
      same strategies for insert/delete generation.

    2. Finer-grained change ranges.

      The old hunk-parsing approach gave us hunk-wide change ranges. This
      was fine for most cases, but increased the likelihood of a bug (now
      fixed) regarding over-aggressive filtering of trailing opcode ranges.
      In other cases, it resulted in unmodified lines appearing in the
      resulting diff, instead of being filtered out, which was more common
      in larger changed hunks with upwards of 6-wide line gaps (the lines
      of context ranges allowed before the hunk is split).

    3. Splitting of opcodes at leading range boundaries.

      The v2 implementation validated and capped opcodes at the trailing
      end. If an opcode straddled a valid and an invalid range boundary,
      it would be split at that boundary and the leading part would be
      yielded while the trailing would be filtered.

      This wasn't sufficient, as opcodes could exist that started before a
      valid range and extended into or past it, and those would end up not
      being seen as valid, causing them to disappear. This would most
      commonly happen at the end of a file, depending on the length of the
      opcode and where it starts.

      This work is now split into 2 phases: A pre-range split phase, that
      covers this case, and the starts-in-range validity + cap phase.

      We now catch all starting and ending conditions of an opcode,
      splitting appropriately and ensuring that we don't end up with any
      pieces that are mistakenly left out.

    4. Off-by-one fixes.

      We had some off-by-one issues with some of the range calculations,
      which could lead to a range being advanced prematurely, depending on
      the position of changed lines. These have all been addressed.

    The new algorithm is heavily documented and remains efficient. There is
    added complexity in that we need to perform a diff on both ends of the
    range, but like our standard diff machinery, we progressively iterate
    as we consume ranges, rather than building up-front and storing it all
    in memory.

    Sine this new process requires original and patched lines in order to
    perform the diff, we now have additional state to funnel into
    filter_interdiff_opcodes(). The opcode generator already has these
    lines, so there's no re-fetching required. We now just save them for the
    duration of the process, passing in the original lines and then the
    differ's copy of the patched lines in. These were already in memory, so
    there's no real additional cost to them.

    Inline functions have also been removed, which simplifies the cost of
    the function.

    All existing unit tests have been updated to ease testing. The expected
    states are all still a match, with the exception that some of them have
    adjustments made for the tighter change ranges we now have. New tests
    have been added based on our collection of customer interdiff data
    (which only includes opcodes and ranges, no file content of any sort).
    This will give us a good built-in set of tests for future work based on
    real-world edge cases that were hit with the v2 algorithm.

    This does not include any visual changes. There are still cases where
    an interdiff may look strange due to the nature of the filtering, as the
    act of interdiff filtering avoids showing some lines that would help make
    the diff make sense. I'd like to address this in the future, but it's no
    worse than it was before, and a larger set of work.

    All unit tests passed.

    Went through every known bad interdiff we had, carefully comparing
    the results from the v2 and v3 algorithm, the filtered and unfiltered
    cases, and the independent diffs at both ends of the revision. Verified
    that the interdiffs weren't missing anything and, in many cases, were more
    fine-grained.

    Commits

    Files