Update the types for diff chunks and lines.
Review Request #14399 — Created April 14, 2025 and updated
This change defines a
TypedDict
for diff chunks, and atuple
for
diff lines, with definitions and documentation for each element.Before, these were just an opaque dictionary, and a heterogenous list.
The diff lines were particularly annoying because they were variable
length (if there was no metadata, that element was just left off the
end), and types were annoying (for "real" line numbers, we'd have an
int if the line was present on that side of the diff, but an empty
string if not). This change technically breaks compatibility in those
two cases.We now always include the metadata field, set to
None
if there's no
metadata for the line. This increases the cached size of a very large
diff with thousands of lines by less that 0.1% (much better than many
years ago with older pickle protocols--the tuple is probably also
helping here).In the case of real line numbers which are not present on one side of
the diff (for example, the line number of an added line for the
left-hand side of the diff), we set the data for that toNone
. In
order to maintain compatibility with current versions of Review Bot, we
changeNone
line numbers back to an empty string for the filediff
resource'sdiff_data
mode.This change only touches code related to chunks and lines. Other diff
viewer cleanup is coming in later changes.
- Tested a ton of different diffs, both in the diff viewer as well as
the text and markdown Review UIs. Verified that all types of edits
(added/edited/deleted/moved) lines all worked as expected. - Ran unit tests.
Summary | ID |
---|---|
9daae1bba689c7145420b91cfb69c0aee56928ea |
Description | From | Last Updated | ||
---|---|---|---|---|
closing bracket does not match indentation of opening bracket's line Column: 17 Error code: E123 |
![]() |
|||
'reviewboard.diffviewer.chunk_generator.DiffLine' imported but unused Column: 5 Error code: F401 |
![]() |
|||
too many blank lines (2) Column: 5 Error code: E303 |
![]() |
|||
undefined name 'none' Column: 42 Error code: F821 |
![]() |
|||
Let's move this and other non-runtime types to a TYPE_CHECKING block. |
|
|||
Can we order the keys in here alphabetically? |
|
|||
I'm trying to understand this set of changes. Bear with me. We were conditionally normalizing, and now we always do. … |
|
|||
All diff operations use i1:i2, j1:j2, and tag as the variables. I'd like to keep these as they were wherever … |
|
|||
Do we have to cast to this? It'd be nice to find away around having to do this for every … |
|
|||
While here, can we pull self.filediff out into a local variable? We access it a lot. |
|
|||
This needs to be a : TypeAlias. Can we define it near the top of the module where we normally … |
|
|||
Let's keep all this as None. It's an explicit "There are no ranges" response, and avoids creating a new object … |
|
|||
The old code was this way to avoid needing to perform comparisons when we know we have our answer. Any … |
|
|||
These can be linenum1 or '' and linenum2 or ''. But I'm not sure this matters, because in the cases … |
|
|||
'typing.Union' imported but unused Column: 1 Error code: F401 |
![]() |
|||
Can we move this into the TYPE_CHECKING? |
|
|||
convert_to_unicode should be before the get_*. |
|
|||
This sentence ("This compatibility ...") is missing a word, I think. |
|
|||
TypeAlias can live in TYPE_CHECKING. |
|
|||
There are no open issues |
- Commits:
-
Summary ID c07cac59fee7c3f20506e9999d70f54eb99148c6 541e055e3978c23ea3bba8f9ee0b67c26fc4134f - Diff:
-
Revision 2 (+1054 -462)
Checks run (2 succeeded)
- Commits:
-
Summary ID 541e055e3978c23ea3bba8f9ee0b67c26fc4134f b65d93070af3cf1bc40238f661da52ad4eeacd2d - Diff:
-
Revision 3 (+1044 -452)
Checks run (2 succeeded)
- Commits:
-
Summary ID b65d93070af3cf1bc40238f661da52ad4eeacd2d e2861454c701dc0aa04163b7aa4a1982233c2cb2 - Diff:
-
Revision 4 (+1044 -452)
Checks run (2 succeeded)
- Commits:
-
Summary ID e2861454c701dc0aa04163b7aa4a1982233c2cb2 111341255393a8d9dc42826a2f5e7e443ddb1ed0 - Diff:
-
Revision 5 (+1044 -452)
Checks run (2 succeeded)
- Commits:
-
Summary ID 111341255393a8d9dc42826a2f5e7e443ddb1ed0 c53a6fb50e23eaecc39f7239a5998d6553c13cee - Diff:
-
Revision 6 (+1044 -452)
Checks run (2 succeeded)
- Commits:
-
Summary ID c53a6fb50e23eaecc39f7239a5998d6553c13cee 43f82cea94288ecb0ca769be4b38c080b27fd028 - Diff:
-
Revision 7 (+1454 -658)
Checks run (2 succeeded)
- Description:
-
This change defines a
TypedDict
for diff chunks, and aNamedTuple
for diff lines, with definitions and documentation for each element. Before, these were just an opaque dictionary, and a heterogenous list.
The diff lines were particularly annoying because they were variable length (if there was no metadata, that element was just left off the end), and types were annoying (for "real" line numbers, we'd have an int if the line was present on that side of the diff, but was set to an empty string if it wasn't). This change technically breaks compatibility in those two cases. We now always include the metadata field, set to None
if there's no metadata for the line. In the case ofreal line numbers which are not present on one side of the diff (for example, the line number of an added line for the left-hand side of the diff), we set the data for that to None
.The
diff_lines
template tag and other consumers now use names insteadof indexing magically into the structure, making it much clearer what's going on. + This also updates the regions field to always be a list. In some cases
+ we had empty lists, and in others we had None
.+ This change only touches code related to chunks and lines. Other diff
viewer cleanup is coming in later changes.
-
There are a couple things I'm going to be really mindful of when reviewing anything touching diffs:
- Terminology. Diff operations use "tags", "i1:i2" ranges, and "j1:j2" ranges, which don't necessarily mean "lines". That's deliberate. I want to keep those i/j naming wherever possible for diffs when the code is generic. I have some old work I intend to dig up some day to add a higher-level process for diffing some kinds of non-text data.
- Performance. Diff code is some of the most expensive code in the codebase. As such, I want to keep things tight where we can, avoid unnecessary work (keep checks optimized, avoid expense if it's just for typing-related benefits). Much of this code is called per-line across potentially tens of thousands of lines or more across a page of diffs, so it matters.
- Regressions/changes in behavior. This code's been around a long time and there are a lot of edge cases covered. Plus we have public APIs around this. I want to avoid changing data types where it's not needed.
For reasons 2 and 3, I'm also pretty hesistant to move to a NamedTuple. Nice as it is, I ultimately don't think we should. It's easier to read and follow, but it carries a performance cost for no reason but code maintenance, and while that's not a bad reason, I think the impact on performance in diff code paths make it less of a gain. NamedTuples cost more to construct and cost more to access. Even if it's just a little, we do it a lot.
-
-
-
I'm trying to understand this set of changes. Bear with me.
We were conditionally normalizing, and now we always do. Looks like before, we weren't using
{old,new}_encoding_list
if there wasn't a class-level encoding list set, causing us to ignore what was passed in. That's fine.However, it changes things when the case where neither is provided, forcing encoding to take place to utf-8. I'm not sure if this is right or wrong, but it's not how we had it before. But what we had before may have been broken. Trying to walk through the scenarios.
And then we're joining to strings. I don't think any of this is necessary. We certainly never did it before, and looking through this code, we never touch
old
,new
,old_str
, ornew_str
in theis_lists
case. We only care aboutold_lines
andnew_lines
. So I think we can just delete that code outright, which is good because it's an expense we don't want.As a tidbit of historical information, at one point I was trying to keep some of this free of assuming byte strings so that we could pass higher-level objects that a Differ could understand, and then generate string results from all that. I have an ancient branch to help get some of this in shape for that, but it's stale. Not too worried about it now, but this is in part the reason we call things
a
andb
and notold_lines
andnew_lines
. -
All diff operations use i1:i2, j1:j2, and tag as the variables. I'd like to keep these as they were wherever we can. They're technically ranges, and not specifically lines (even if we are using them as lines in this use case).
-
Do we have to cast to this? It'd be nice to find away around having to do this for every range of opcodes.
If we have to, we shouldn't redefine the type for every line. We should have a type alias to refer to. But it'd be nice to figure out how to avoid this entirely.
-
-
This needs to be a
: TypeAlias
.Can we define it near the top of the module where we normally define these?
-
Let's keep all this as
None
. It's an explicit "There are no ranges" response, and avoids creating a new object for each time (ideal since this is all called quite a lot). -
The old code was this way to avoid needing to perform comparisons when we know we have our answer. Any optimizations we can get out of diff code is beneficial, since this is an expensive path.
-
These can be
linenum1 or ''
andlinenum2 or ''
.But I'm not sure this matters, because in the cases where these are
None
, we don't make use of them in the template. So theNone
value is fine.
- Change Summary:
-
- Switch back to a regular tuple for lines.
- Make other requested changes.
- Description:
-
~ This change defines a
TypedDict
for diff chunks, and aNamedTuple
~ for diff lines, with definitions and documentation for each element. ~ This change defines a
TypedDict
for diff chunks, and atuple
for~ diff lines, with definitions and documentation for each element. Before, these were just an opaque dictionary, and a heterogenous list.
The diff lines were particularly annoying because they were variable length (if there was no metadata, that element was just left off the end), and types were annoying (for "real" line numbers, we'd have an ~ int if the line was present on that side of the diff, but was set to ~ an empty string if it wasn't). This change technically breaks ~ compatibility in those two cases. We now always include the metadata ~ int if the line was present on that side of the diff, but an empty ~ string if not). This change technically breaks compatibility in those ~ two cases. - field, set to None
if there's no metadata for the line. In the case of- real line numbers which are not present on one side of the diff (for - example, the line number of an added line for the left-hand side of the - diff), we set the data for that to None
.~ The
diff_lines
template tag and other consumers now use names instead~ of indexing magically into the structure, making it much clearer what's ~ going on. ~ We now always include the metadata field, set to
None
if there's no~ metadata for the line. This increases the cached size of a very large ~ diff with thousands of lines by less that 0.1% (much better than many + years ago with older pickle protocols--the tuple is probably also + helping here). ~ This also updates the regions field to always be a list. In some cases
~ we had empty lists, and in others we had None
.~ In the case of real line numbers which are not present on one side of
~ the diff (for example, the line number of an added line for the + left-hand side of the diff), we set the data for that to None
. In+ order to maintain compatibility with current versions of Review Bot, we + change None
line numbers back to an empty string for the filediff+ resource's diff_data
mode.This change only touches code related to chunks and lines. Other diff
viewer cleanup is coming in later changes. - Commits:
-
Summary ID 43f82cea94288ecb0ca769be4b38c080b27fd028 c5d90d8cfabbec35fda0b34155bf69a2f4bf300c - Diff:
-
Revision 8 (+1498 -616)
- Commits:
-
Summary ID c5d90d8cfabbec35fda0b34155bf69a2f4bf300c 0f3328ecb62eb5bcd98f8721e4dca0bebe733e84 - Diff:
-
Revision 9 (+1498 -616)
Checks run (2 succeeded)
- Commits:
-
Summary ID 0f3328ecb62eb5bcd98f8721e4dca0bebe733e84 9daae1bba689c7145420b91cfb69c0aee56928ea - Diff:
-
Revision 10 (+1502 -616)