• 
      

    Introduce the Interdiff Filtering Algorithm v3.

    Review Request #15022 — Created April 27, 2026 and updated

    Information

    Review Board
    release-8.x

    Reviewers

    We've had interdiff filtering for 20 years now, and have gone through a
    couple of major revisions of the algorithm. The latest, v2, has been
    around the longest, but had a few edge cases that could sometimes lead
    to over-filtered results or longer stretches of changes that didn't
    accurately filter out all of the upstream changes.

    This change introduces the new version 3 of the algorithm, with a few
    primary differences:

    1. We no longer parse diff hunks to gather changed range information.
      Instead, we generate our own using our differ and settings, ensuring
      the algorithms used for placement is the same between the interdiff
      and the change regions.

      Diff algorithms may choose different strategies when deciding which
      of a set of similar lines should be inserted or deleted. Braces are a
      good example. The addition of a couple braces and a line of code
      within could end up appearing after an existing closing brace or
      before an existing opening brace.

      By choosing the same algorithm across the board, we should ensure the
      same strategies for insert/delete generation.

    2. Finer-grained change ranges.

      The old hunk-parsing approach gave us hunk-wide change ranges. This
      was fine for most cases, but increased the likelihood of a bug (now
      fixed) regarding over-aggressive filtering of trailing opcode ranges.
      In other cases, it resulted in unmodified lines appearing in the
      resulting diff, instead of being filtered out, which was more common
      in larger changed hunks with upwards of 6-wide line gaps (the lines
      of context ranges allowed before the hunk is split).

    3. Splitting of opcodes at leading range boundaries.

      The v2 implementation validated and capped opcodes at the trailing
      end. If an opcode straddled a valid and an invalid range boundary,
      it would be split at that boundary and the leading part would be
      yielded while the trailing would be filtered.

      This wasn't sufficient, as opcodes could exist that started before a
      valid range and extended into or past it, and those would end up not
      being seen as valid, causing them to disappear. This would most
      commonly happen at the end of a file, depending on the length of the
      opcode and where it starts.

      This work is now split into 2 phases: A pre-range split phase, that
      covers this case, and the starts-in-range validity + cap phase.

      We now catch all starting and ending conditions of an opcode,
      splitting appropriately and ensuring that we don't end up with any
      pieces that are mistakenly left out.

    4. Off-by-one fixes.

      We had some off-by-one issues with some of the range calculations,
      which could lead to a range being advanced prematurely, depending on
      the position of changed lines. These have all been addressed.

    The new algorithm is heavily documented and remains efficient. There is
    added complexity in that we need to perform a diff on both ends of the
    range, but like our standard diff machinery, we progressively iterate
    as we consume ranges, rather than building up-front and storing it all
    in memory.

    Sine this new process requires original and patched lines in order to
    perform the diff, we now have additional state to funnel into
    filter_interdiff_opcodes(). The opcode generator already has these
    lines, so there's no re-fetching required. We now just save them for the
    duration of the process, passing in the original lines and then the
    differ's copy of the patched lines in. These were already in memory, so
    there's no real additional cost to them.

    Inline functions have also been removed, which simplifies the cost of
    the function.

    All existing unit tests have been updated to ease testing. The expected
    states are all still a match, with the exception that some of them have
    adjustments made for the tighter change ranges we now have. New tests
    have been added based on our collection of customer interdiff data
    (which only includes opcodes and ranges, no file content of any sort).
    This will give us a good built-in set of tests for future work based on
    real-world edge cases that were hit with the v2 algorithm.

    This does not include any visual changes. There are still cases where
    an interdiff may look strange due to the nature of the filtering, as the
    act of interdiff filtering avoids showing some lines that would help make
    the diff make sense. I'd like to address this in the future, but it's no
    worse than it was before, and a larger set of work.

    All unit tests passed.

    Went through every known bad interdiff we had, carefully comparing
    the results from the v2 and v3 algorithm, the filtered and unfiltered
    cases, and the independent diffs at both ends of the revision. Verified
    that the interdiffs weren't missing anything and, in many cases, were more
    fine-grained.

    Summary ID
    Introduce the Interdiff Filtering Algorithm v3.
    We've had interdiff filtering for 20 years now, and have gone through a couple of major revisions of the algorithm. The latest, v2, has been around the longest, but had a few edge cases that could sometimes lead to over-filtered results or longer stretches of changes that didn't accurately filter out all of the upstream changes. This change introduces the new version 3 of the algorithm, with a few primary differences: 1. We no longer parse diff hunks to gather changed range information. Instead, we generate our own using our differ and settings, ensuring the algorithms used for placement is the same between the interdiff and the change regions. Diff algorithms may choose different strategies when deciding which of a set of similar lines should be inserted or deleted. Braces are a good example. The addition of a couple braces and a line of code within could end up appearing after an existing closing brace or before an existing opening brace. By choosing the same algorithm across the board, we should ensure the same strategies for insert/delete generation. 2. Finer-grained change ranges. The old hunk-parsing approach gave us hunk-wide change ranges. This was fine for most cases, but increased the likelihood of a bug (now fixed) regarding over-aggressive filtering of trailing opcode ranges. In other cases, it resulted in unmodified lines appearing in the resulting diff, instead of being filtered out, which was more common in larger changed hunks with upwards of 6-wide line gaps (the lines of context ranges allowed before the hunk is split). 3. Splitting of opcodes at leading range boundaries. The v2 implementation validated and capped opcodes at the trailing end. If an opcode straddled a valid and an invalid range boundary, it would be split at that boundary and the leading part would be yielded while the trailing would be filtered. This wasn't sufficient, as opcodes could exist that started before a valid range and extended into or past it, and those would end up not being seen as valid, causing them to disappear. This would most commonly happen at the end of a file, depending on the length of the opcode and where it starts. This work is now split into 2 phases: A pre-range split phase, that covers this case, and the starts-in-range validity + cap phase. We now catch all starting and ending conditions of an opcode, splitting appropriately and ensuring that we don't end up with any pieces that are mistakenly left out. 4. Off-by-one fixes. We had some off-by-one issues with some of the range calculations, which could lead to a range being advanced prematurely, depending on the position of changed lines. These have all been addressed. The new algorithm is heavily documented and remains efficient. There is added complexity in that we need to perform a diff on both ends of the range, but like our standard diff machinery, we progressively iterate as we consume ranges, rather than building up-front and storing it all in memory. Sine this new process requires original and patched lines in order to perform the diff, we now have additional state to funnel into `filter_interdiff_opcodes()`. The opcode generator already has these lines, so there's no re-fetching required. We now just save them for the duration of the process, passing in the original lines and then the differ's copy of the patched lines in. These were already in memory, so there's no real additional cost to them. Inline functions have also been removed, which simplifies the cost of the function. All existing unit tests have been updated to ease testing. The expected states are all still a match, with the exception that some of them have adjustments made for the tighter change ranges we now have. New tests have been added based on our collection of customer interdiff data (which only includes opcodes and ranges, no file content of any sort). This will give us a good built-in set of tests for future work based on real-world edge cases that were hit with the v2 algorithm.
    bd724272dc45d24e2e32dfc02db046ce3533028b
    Description From Last Updated

    'reviewboard.diffviewer.myersdiff.MyersDiffer' imported but unused Column: 1 Error code: F401

    reviewbot reviewbot

    'reviewboard.diffviewer.settings.DiffSettings' imported but unused Column: 1 Error code: F401

    reviewbot reviewbot
    Checks run (1 failed, 1 succeeded)
    flake8 failed.
    JSHint passed.

    flake8

    chipx86
    chipx86
    Review request changed
    Change Summary:

    Removed unused imports.

    Commits:
    Summary ID
    Introduce the Interdiff Filtering Algorithm v3.
    We've had interdiff filtering for 20 years now, and have gone through a couple of major revisions of the algorithm. The latest, v2, has been around the longest, but had a few edge cases that could sometimes lead to over-filtered results or longer stretches of changes that didn't accurately filter out all of the upstream changes. This change introduces the new version 3 of the algorithm, with a few primary differences: 1. We no longer parse diff hunks to gather changed range information. Instead, we generate our own using our differ and settings, ensuring the algorithms used for placement is the same between the interdiff and the change regions. Diff algorithms may choose different strategies when deciding which of a set of similar lines should be inserted or deleted. Braces are a good example. The addition of a couple braces and a line of code within could end up appearing after an existing closing brace or before an existing opening brace. By choosing the same algorithm across the board, we should ensure the same strategies for insert/delete generation. 2. Finer-grained change ranges. The old hunk-parsing approach gave us hunk-wide change ranges. This was fine for most cases, but increased the likelihood of a bug (now fixed) regarding over-aggressive filtering of trailing opcode ranges. In other cases, it resulted in unmodified lines appearing in the resulting diff, instead of being filtered out, which was more common in larger changed hunks with upwards of 6-wide line gaps (the lines of context ranges allowed before the hunk is split). 3. Splitting of opcodes at leading range boundaries. The v2 implementation validated and capped opcodes at the trailing end. If an opcode straddled a valid and an invalid range boundary, it would be split at that boundary and the leading part would be yielded while the trailing would be filtered. This wasn't sufficient, as opcodes could exist that started before a valid range and extended into or past it, and those would end up not being seen as valid, causing them to disappear. This would most commonly happen at the end of a file, depending on the length of the opcode and where it starts. This work is now split into 2 phases: A pre-range split phase, that covers this case, and the starts-in-range validity + cap phase. We now catch all starting and ending conditions of an opcode, splitting appropriately and ensuring that we don't end up with any pieces that are mistakenly left out. 4. Off-by-one fixes. We had some off-by-one issues with some of the range calculations, which could lead to a range being advanced prematurely, depending on the position of changed lines. These have all been addressed. The new algorithm is heavily documented and remains efficient. There is added complexity in that we need to perform a diff on both ends of the range, but like our standard diff machinery, we progressively iterate as we consume ranges, rather than building up-front and storing it all in memory. Sine this new process requires original and patched lines in order to perform the diff, we now have additional state to funnel into `filter_interdiff_opcodes()`. The opcode generator already has these lines, so there's no re-fetching required. We now just save them for the duration of the process, passing in the original lines and then the differ's copy of the patched lines in. These were already in memory, so there's no real additional cost to them. Inline functions have also been removed, which simplifies the cost of the function. All existing unit tests have been updated to ease testing. The expected states are all still a match, with the exception that some of them have adjustments made for the tighter change ranges we now have. New tests have been added based on our collection of customer interdiff data (which only includes opcodes and ranges, no file content of any sort). This will give us a good built-in set of tests for future work based on real-world edge cases that were hit with the v2 algorithm.
    e3267404f9acc4514107c35b867e0fcbf135bb05
    Introduce the Interdiff Filtering Algorithm v3.
    We've had interdiff filtering for 20 years now, and have gone through a couple of major revisions of the algorithm. The latest, v2, has been around the longest, but had a few edge cases that could sometimes lead to over-filtered results or longer stretches of changes that didn't accurately filter out all of the upstream changes. This change introduces the new version 3 of the algorithm, with a few primary differences: 1. We no longer parse diff hunks to gather changed range information. Instead, we generate our own using our differ and settings, ensuring the algorithms used for placement is the same between the interdiff and the change regions. Diff algorithms may choose different strategies when deciding which of a set of similar lines should be inserted or deleted. Braces are a good example. The addition of a couple braces and a line of code within could end up appearing after an existing closing brace or before an existing opening brace. By choosing the same algorithm across the board, we should ensure the same strategies for insert/delete generation. 2. Finer-grained change ranges. The old hunk-parsing approach gave us hunk-wide change ranges. This was fine for most cases, but increased the likelihood of a bug (now fixed) regarding over-aggressive filtering of trailing opcode ranges. In other cases, it resulted in unmodified lines appearing in the resulting diff, instead of being filtered out, which was more common in larger changed hunks with upwards of 6-wide line gaps (the lines of context ranges allowed before the hunk is split). 3. Splitting of opcodes at leading range boundaries. The v2 implementation validated and capped opcodes at the trailing end. If an opcode straddled a valid and an invalid range boundary, it would be split at that boundary and the leading part would be yielded while the trailing would be filtered. This wasn't sufficient, as opcodes could exist that started before a valid range and extended into or past it, and those would end up not being seen as valid, causing them to disappear. This would most commonly happen at the end of a file, depending on the length of the opcode and where it starts. This work is now split into 2 phases: A pre-range split phase, that covers this case, and the starts-in-range validity + cap phase. We now catch all starting and ending conditions of an opcode, splitting appropriately and ensuring that we don't end up with any pieces that are mistakenly left out. 4. Off-by-one fixes. We had some off-by-one issues with some of the range calculations, which could lead to a range being advanced prematurely, depending on the position of changed lines. These have all been addressed. The new algorithm is heavily documented and remains efficient. There is added complexity in that we need to perform a diff on both ends of the range, but like our standard diff machinery, we progressively iterate as we consume ranges, rather than building up-front and storing it all in memory. Sine this new process requires original and patched lines in order to perform the diff, we now have additional state to funnel into `filter_interdiff_opcodes()`. The opcode generator already has these lines, so there's no re-fetching required. We now just save them for the duration of the process, passing in the original lines and then the differ's copy of the patched lines in. These were already in memory, so there's no real additional cost to them. Inline functions have also been removed, which simplifies the cost of the function. All existing unit tests have been updated to ease testing. The expected states are all still a match, with the exception that some of them have adjustments made for the tighter change ranges we now have. New tests have been added based on our collection of customer interdiff data (which only includes opcodes and ranges, no file content of any sort). This will give us a good built-in set of tests for future work based on real-world edge cases that were hit with the v2 algorithm.
    bd724272dc45d24e2e32dfc02db046ce3533028b

    Checks run (2 succeeded)

    flake8 passed.
    JSHint passed.