Introduce the Interdiff Filtering Algorithm v3.
Review Request #15022 — Created April 27, 2026 and updated — Latest diff uploaded
We've had interdiff filtering for 20 years now, and have gone through a
couple of major revisions of the algorithm. The latest, v2, has been
around the longest, but had a few edge cases that could sometimes lead
to over-filtered results or longer stretches of changes that didn't
accurately filter out all of the upstream changes.This change introduces the new version 3 of the algorithm, with a few
primary differences:
We no longer parse diff hunks to gather changed range information.
Instead, we generate our own using our differ and settings, ensuring
the algorithms used for placement is the same between the interdiff
and the change regions.Diff algorithms may choose different strategies when deciding which
of a set of similar lines should be inserted or deleted. Braces are a
good example. The addition of a couple braces and a line of code
within could end up appearing after an existing closing brace or
before an existing opening brace.By choosing the same algorithm across the board, we should ensure the
same strategies for insert/delete generation.Finer-grained change ranges.
The old hunk-parsing approach gave us hunk-wide change ranges. This
was fine for most cases, but increased the likelihood of a bug (now
fixed) regarding over-aggressive filtering of trailing opcode ranges.
In other cases, it resulted in unmodified lines appearing in the
resulting diff, instead of being filtered out, which was more common
in larger changed hunks with upwards of 6-wide line gaps (the lines
of context ranges allowed before the hunk is split).Splitting of opcodes at leading range boundaries.
The v2 implementation validated and capped opcodes at the trailing
end. If an opcode straddled a valid and an invalid range boundary,
it would be split at that boundary and the leading part would be
yielded while the trailing would be filtered.This wasn't sufficient, as opcodes could exist that started before a
valid range and extended into or past it, and those would end up not
being seen as valid, causing them to disappear. This would most
commonly happen at the end of a file, depending on the length of the
opcode and where it starts.This work is now split into 2 phases: A pre-range split phase, that
covers this case, and the starts-in-range validity + cap phase.We now catch all starting and ending conditions of an opcode,
splitting appropriately and ensuring that we don't end up with any
pieces that are mistakenly left out.Off-by-one fixes.
We had some off-by-one issues with some of the range calculations,
which could lead to a range being advanced prematurely, depending on
the position of changed lines. These have all been addressed.The new algorithm is heavily documented and remains efficient. There is
added complexity in that we need to perform a diff on both ends of the
range, but like our standard diff machinery, we progressively iterate
as we consume ranges, rather than building up-front and storing it all
in memory.Sine this new process requires original and patched lines in order to
perform the diff, we now have additional state to funnel into
filter_interdiff_opcodes(). The opcode generator already has these
lines, so there's no re-fetching required. We now just save them for the
duration of the process, passing in the original lines and then the
differ's copy of the patched lines in. These were already in memory, so
there's no real additional cost to them.Inline functions have also been removed, which simplifies the cost of
the function.All existing unit tests have been updated to ease testing. The expected
states are all still a match, with the exception that some of them have
adjustments made for the tighter change ranges we now have. New tests
have been added based on our collection of customer interdiff data
(which only includes opcodes and ranges, no file content of any sort).
This will give us a good built-in set of tests for future work based on
real-world edge cases that were hit with the v2 algorithm.This does not include any visual changes. There are still cases where
an interdiff may look strange due to the nature of the filtering, as the
act of interdiff filtering avoids showing some lines that would help make
the diff make sense. I'd like to address this in the future, but it's no
worse than it was before, and a larger set of work.
All unit tests passed.
Went through every known bad interdiff we had, carefully comparing
the results from the v2 and v3 algorithm, the filtered and unfiltered
cases, and the independent diffs at both ends of the revision. Verified
that the interdiffs weren't missing anything and, in many cases, were more
fine-grained.