Update the types for diff chunks and lines.

Review Request #14399 — Created April 14, 2025 and updated

Information

Review Board
master

Reviewers

This change defines a TypedDict for diff chunks, and a tuple for
diff lines, with definitions and documentation for each element.

Before, these were just an opaque dictionary, and a heterogenous list.
The diff lines were particularly annoying because they were variable
length (if there was no metadata, that element was just left off the
end), and types were annoying (for "real" line numbers, we'd have an
int if the line was present on that side of the diff, but an empty
string if not). This change technically breaks compatibility in those
two cases.

We now always include the metadata field, set to None if there's no
metadata for the line. This increases the cached size of a very large
diff with thousands of lines by less that 0.1% (much better than many
years ago with older pickle protocols--the tuple is probably also
helping here).

In the case of real line numbers which are not present on one side of
the diff (for example, the line number of an added line for the
left-hand side of the diff), we set the data for that to None. In
order to maintain compatibility with current versions of Review Bot, we
change None line numbers back to an empty string for the filediff
resource's diff_data mode.

This change only touches code related to chunks and lines. Other diff
viewer cleanup is coming in later changes.

  • Tested a ton of different diffs, both in the diff viewer as well as
    the text and markdown Review UIs. Verified that all types of edits
    (added/edited/deleted/moved) lines all worked as expected.
  • Ran unit tests.
Summary ID
Update the types for diff chunks and lines.
This change defines a `TypedDict` for diff chunks, and a `tuple` for diff lines, with definitions and documentation for each element. Before, these were just an opaque dictionary, and a heterogenous list. The diff lines were particularly annoying because they were variable length (if there was no metadata, that element was just left off the end), and types were annoying (for "real" line numbers, we'd have an int if the line was present on that side of the diff, but an empty string if not). This change technically breaks compatibility in those two cases. We now always include the metadata field, set to `None` if there's no metadata for the line. This increases the cached size of a very large diff with thousands of lines by less that 0.1% (much better than many years ago with older pickle protocols--the tuple is probably also helping here). In the case of real line numbers which are not present on one side of the diff (for example, the line number of an added line for the left-hand side of the diff), we set the data for that to `None`. In order to maintain compatibility with current versions of Review Bot, we change `None` line numbers back to an empty string for the filediff resource's `diff_data` mode. This change only touches code related to chunks and lines. Other diff viewer cleanup is coming in later changes. Testing Done: - Tested a ton of different diffs, both in the diff viewer as well as the text and markdown Review UIs. Verified that all types of edits (added/edited/deleted/moved) lines all worked as expected. - Ran unit tests.
9daae1bba689c7145420b91cfb69c0aee56928ea
Description From Last Updated

closing bracket does not match indentation of opening bracket's line Column: 17 Error code: E123

reviewbotreviewbot

'reviewboard.diffviewer.chunk_generator.DiffLine' imported but unused Column: 5 Error code: F401

reviewbotreviewbot

too many blank lines (2) Column: 5 Error code: E303

reviewbotreviewbot

undefined name 'none' Column: 42 Error code: F821

reviewbotreviewbot

Let's move this and other non-runtime types to a TYPE_CHECKING block.

chipx86chipx86

Can we order the keys in here alphabetically?

chipx86chipx86

I'm trying to understand this set of changes. Bear with me. We were conditionally normalizing, and now we always do. …

chipx86chipx86

All diff operations use i1:i2, j1:j2, and tag as the variables. I'd like to keep these as they were wherever …

chipx86chipx86

Do we have to cast to this? It'd be nice to find away around having to do this for every …

chipx86chipx86

While here, can we pull self.filediff out into a local variable? We access it a lot.

chipx86chipx86

This needs to be a : TypeAlias. Can we define it near the top of the module where we normally …

chipx86chipx86

Let's keep all this as None. It's an explicit "There are no ranges" response, and avoids creating a new object …

chipx86chipx86

The old code was this way to avoid needing to perform comparisons when we know we have our answer. Any …

chipx86chipx86

These can be linenum1 or '' and linenum2 or ''. But I'm not sure this matters, because in the cases …

chipx86chipx86

'typing.Union' imported but unused Column: 1 Error code: F401

reviewbotreviewbot

Can we move this into the TYPE_CHECKING?

chipx86chipx86

convert_to_unicode should be before the get_*.

chipx86chipx86

This sentence ("This compatibility ...") is missing a word, I think.

chipx86chipx86

TypeAlias can live in TYPE_CHECKING.

chipx86chipx86
There are no open issues
Checks run (1 failed, 1 succeeded)
flake8 failed.
JSHint passed.

flake8

david
david
david
david
david
david
david
chipx86
  1. There are a couple things I'm going to be really mindful of when reviewing anything touching diffs:

    1. Terminology. Diff operations use "tags", "i1:i2" ranges, and "j1:j2" ranges, which don't necessarily mean "lines". That's deliberate. I want to keep those i/j naming wherever possible for diffs when the code is generic. I have some old work I intend to dig up some day to add a higher-level process for diffing some kinds of non-text data.
    2. Performance. Diff code is some of the most expensive code in the codebase. As such, I want to keep things tight where we can, avoid unnecessary work (keep checks optimized, avoid expense if it's just for typing-related benefits). Much of this code is called per-line across potentially tens of thousands of lines or more across a page of diffs, so it matters.
    3. Regressions/changes in behavior. This code's been around a long time and there are a lot of edge cases covered. Plus we have public APIs around this. I want to avoid changing data types where it's not needed.

    For reasons 2 and 3, I'm also pretty hesistant to move to a NamedTuple. Nice as it is, I ultimately don't think we should. It's easier to read and follow, but it carries a performance cost for no reason but code maintenance, and while that's not a bad reason, I think the impact on performance in diff code paths make it less of a gain. NamedTuples cost more to construct and cost more to access. Even if it's just a little, we do it a lot.

  2. Show all issues

    Let's move this and other non-runtime types to a TYPE_CHECKING block.

  3. Show all issues

    Can we order the keys in here alphabetically?

  4. reviewboard/diffviewer/chunk_generator.py (Diff revision 7)
     
     
     
     
     
     
     
     
    Show all issues

    I'm trying to understand this set of changes. Bear with me.

    We were conditionally normalizing, and now we always do. Looks like before, we weren't using {old,new}_encoding_list if there wasn't a class-level encoding list set, causing us to ignore what was passed in. That's fine.

    However, it changes things when the case where neither is provided, forcing encoding to take place to utf-8. I'm not sure if this is right or wrong, but it's not how we had it before. But what we had before may have been broken. Trying to walk through the scenarios.

    And then we're joining to strings. I don't think any of this is necessary. We certainly never did it before, and looking through this code, we never touch old, new, old_str, or new_str in the is_lists case. We only care about old_lines and new_lines. So I think we can just delete that code outright, which is good because it's an expense we don't want.

    As a tidbit of historical information, at one point I was trying to keep some of this free of assuming byte strings so that we could pass higher-level objects that a Differ could understand, and then generate string results from all that. I have an ancient branch to help get some of this in shape for that, but it's stale. Not too worried about it now, but this is in part the reason we call things a and b and not old_lines and new_lines.

    1. In the constructor, we have:

      self.encoding_list = encoding_list or ['iso-8859-15']
      

      So there's always a class-level list, and we wouldn't have skipped it in the past. In the case where theoretically would have skipped the normalize calls, old and new would still have been set to bytes, which would break things later on. I'm not sure iso-8859-15 is the right thing in the modern era, but that's a change I'd want to make much more cautiously.

      Joining to strings is something I was doing to make it so old_str, old_bytes, new_str, and new_bytes were always bound. Given that we don't apply syntax highlighting for the is_lists case, I'm just going to rearrange things so this isn't an issue. If we wanted to make it so highlighting could apply in the is_lists case as well, we'd need to create the bytes and str versions of the files. From what I can tell, the lists version is only used for the rendered version of text-based review UIs, where we don't want to apply highlighting.

  5. reviewboard/diffviewer/chunk_generator.py (Diff revision 7)
     
     
     
     
     
     
    Show all issues

    All diff operations use i1:i2, j1:j2, and tag as the variables. I'd like to keep these as they were wherever we can. They're technically ranges, and not specifically lines (even if we are using them as lines in this use case).

  6. reviewboard/diffviewer/chunk_generator.py (Diff revision 7)
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    Do we have to cast to this? It'd be nice to find away around having to do this for every range of opcodes.

    If we have to, we shouldn't redefine the type for every line. We should have a type alias to refer to. But it'd be nice to figure out how to avoid this entirely.

  7. reviewboard/diffviewer/chunk_generator.py (Diff revision 7)
     
     
     
     
     
     
     
     
     
     
    Show all issues

    While here, can we pull self.filediff out into a local variable? We access it a lot.

  8. reviewboard/diffviewer/diffutils.py (Diff revision 7)
     
     
    Show all issues

    This needs to be a : TypeAlias.

    Can we define it near the top of the module where we normally define these?

  9. reviewboard/diffviewer/diffutils.py (Diff revision 7)
     
     
    Show all issues

    Let's keep all this as None. It's an explicit "There are no ranges" response, and avoids creating a new object for each time (ideal since this is all called quite a lot).

  10. reviewboard/diffviewer/templatetags/difftags.py (Diff revision 7)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    The old code was this way to avoid needing to perform comparisons when we know we have our answer. Any optimizations we can get out of diff code is beneficial, since this is an expensive path.

  11. reviewboard/diffviewer/templatetags/difftags.py (Diff revision 7)
     
     
     
     
     
     
     
     
     
     
    Show all issues

    These can be linenum1 or '' and linenum2 or ''.

    But I'm not sure this matters, because in the cases where these are None, we don't make use of them in the template. So the None value is fine.

  12. 
      
david
Review request changed
Change Summary:
  • Switch back to a regular tuple for lines.
  • Make other requested changes.
Description:
~  

This change defines a TypedDict for diff chunks, and a NamedTuple

~   for diff lines, with definitions and documentation for each element.

  ~

This change defines a TypedDict for diff chunks, and a tuple for

  ~ diff lines, with definitions and documentation for each element.

   
   

Before, these were just an opaque dictionary, and a heterogenous list.

    The diff lines were particularly annoying because they were variable
    length (if there was no metadata, that element was just left off the
    end), and types were annoying (for "real" line numbers, we'd have an
~   int if the line was present on that side of the diff, but was set to
~   an empty string if it wasn't). This change technically breaks
~   compatibility in those two cases. We now always include the metadata
  ~ int if the line was present on that side of the diff, but an empty
  ~ string if not). This change technically breaks compatibility in those
  ~ two cases.

-   field, set to None if there's no metadata for the line. In the case of
-   real line numbers which are not present on one side of the diff (for
-   example, the line number of an added line for the left-hand side of the
-   diff), we set the data for that to None.

   
~  

The diff_lines template tag and other consumers now use names instead

~   of indexing magically into the structure, making it much clearer what's
~   going on.

  ~

We now always include the metadata field, set to None if there's no

  ~ metadata for the line. This increases the cached size of a very large
  ~ diff with thousands of lines by less that 0.1% (much better than many
  + years ago with older pickle protocols--the tuple is probably also
  + helping here).

   
~  

This also updates the regions field to always be a list. In some cases

~   we had empty lists, and in others we had None.

  ~

In the case of real line numbers which are not present on one side of

  ~ the diff (for example, the line number of an added line for the
  + left-hand side of the diff), we set the data for that to None. In
  + order to maintain compatibility with current versions of Review Bot, we
  + change None line numbers back to an empty string for the filediff
  + resource's diff_data mode.

   
   

This change only touches code related to chunks and lines. Other diff

    viewer cleanup is coming in later changes.

Commits:
Summary ID
Update the types for diff chunks and lines.
This change defines a `TypedDict` for diff chunks, and a `NamedTuple` for diff lines, with definitions and documentation for each element. Before, these were just an opaque dictionary, and a heterogenous list. The diff lines were particularly annoying because they were variable length (if there was no metadata, that element was just left off the end), and types were annoying (for "real" line numbers, we'd have an int if the line was present on that side of the diff, but was set to an empty string if it wasn't). This change technically breaks compatibility in those two cases. We now always include the metadata field, set to `None` if there's no metadata for the line. In the case of real line numbers which are not present on one side of the diff (for example, the line number of an added line for the left-hand side of the diff), we set the data for that to `None`. The `diff_lines` template tag and other consumers now use names instead of indexing magically into the structure, making it much clearer what's going on. This also updates the regions field to always be a list. In some cases we had empty lists, and in others we had `None`. This change only touches code related to chunks and lines. Other diff viewer cleanup is coming in later changes. Testing Done: - Tested a ton of different diffs, both in the diff viewer as well as the text and markdown Review UIs. Verified that all types of edits (added/edited/deleted/moved) lines all worked as expected. - Ran unit tests.
43f82cea94288ecb0ca769be4b38c080b27fd028
Update the types for diff chunks and lines.
This change defines a `TypedDict` for diff chunks, and a `tuple` for diff lines, with definitions and documentation for each element. Before, these were just an opaque dictionary, and a heterogenous list. The diff lines were particularly annoying because they were variable length (if there was no metadata, that element was just left off the end), and types were annoying (for "real" line numbers, we'd have an int if the line was present on that side of the diff, but an empty string if not). This change technically breaks compatibility in those two cases. We now always include the metadata field, set to `None` if there's no metadata for the line. This increases the cached size of a very large diff with thousands of lines by less that 0.1% (much better than many years ago with older pickle protocols--the tuple is probably also helping here). In the case of real line numbers which are not present on one side of the diff (for example, the line number of an added line for the left-hand side of the diff), we set the data for that to `None`. In order to maintain compatibility with current versions of Review Bot, we change `None` line numbers back to an empty string for the filediff resource's `diff_data` mode. This change only touches code related to chunks and lines. Other diff viewer cleanup is coming in later changes. Testing Done: - Tested a ton of different diffs, both in the diff viewer as well as the text and markdown Review UIs. Verified that all types of edits (added/edited/deleted/moved) lines all worked as expected. - Ran unit tests.
c5d90d8cfabbec35fda0b34155bf69a2f4bf300c
Diff:

Revision 8 (+1498 -616)

Show changes

reviewboard/diffviewer/chunk_generator.py
reviewboard/diffviewer/diffutils.py
reviewboard/diffviewer/templatetags/difftags.py
reviewboard/diffviewer/tests/test_diff_chunk_generator.py
reviewboard/diffviewer/tests/test_raw_diff_chunk_generator.py
reviewboard/reviews/tests/test_markdown_review_ui.py
reviewboard/reviews/tests/test_text_based_review_ui.py
reviewboard/webapi/resources/filediff.py
1 more

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

david
chipx86
  1. 
      
  2. Show all issues

    Can we move this into the TYPE_CHECKING?

  3. reviewboard/diffviewer/chunk_generator.py (Diff revision 9)
     
     
     
    Show all issues

    convert_to_unicode should be before the get_*.

  4. reviewboard/diffviewer/chunk_generator.py (Diff revision 9)
     
     
     
    Show all issues

    This sentence ("This compatibility ...") is missing a word, I think.

  5. reviewboard/diffviewer/diffutils.py (Diff revision 9)
     
     
    Show all issues

    TypeAlias can live in TYPE_CHECKING.

  6. 
      
david
Review request changed
Commits:
Summary ID
Update the types for diff chunks and lines.
This change defines a `TypedDict` for diff chunks, and a `tuple` for diff lines, with definitions and documentation for each element. Before, these were just an opaque dictionary, and a heterogenous list. The diff lines were particularly annoying because they were variable length (if there was no metadata, that element was just left off the end), and types were annoying (for "real" line numbers, we'd have an int if the line was present on that side of the diff, but an empty string if not). This change technically breaks compatibility in those two cases. We now always include the metadata field, set to `None` if there's no metadata for the line. This increases the cached size of a very large diff with thousands of lines by less that 0.1% (much better than many years ago with older pickle protocols--the tuple is probably also helping here). In the case of real line numbers which are not present on one side of the diff (for example, the line number of an added line for the left-hand side of the diff), we set the data for that to `None`. In order to maintain compatibility with current versions of Review Bot, we change `None` line numbers back to an empty string for the filediff resource's `diff_data` mode. This change only touches code related to chunks and lines. Other diff viewer cleanup is coming in later changes. Testing Done: - Tested a ton of different diffs, both in the diff viewer as well as the text and markdown Review UIs. Verified that all types of edits (added/edited/deleted/moved) lines all worked as expected. - Ran unit tests.
0f3328ecb62eb5bcd98f8721e4dca0bebe733e84
Update the types for diff chunks and lines.
This change defines a `TypedDict` for diff chunks, and a `tuple` for diff lines, with definitions and documentation for each element. Before, these were just an opaque dictionary, and a heterogenous list. The diff lines were particularly annoying because they were variable length (if there was no metadata, that element was just left off the end), and types were annoying (for "real" line numbers, we'd have an int if the line was present on that side of the diff, but an empty string if not). This change technically breaks compatibility in those two cases. We now always include the metadata field, set to `None` if there's no metadata for the line. This increases the cached size of a very large diff with thousands of lines by less that 0.1% (much better than many years ago with older pickle protocols--the tuple is probably also helping here). In the case of real line numbers which are not present on one side of the diff (for example, the line number of an added line for the left-hand side of the diff), we set the data for that to `None`. In order to maintain compatibility with current versions of Review Bot, we change `None` line numbers back to an empty string for the filediff resource's `diff_data` mode. This change only touches code related to chunks and lines. Other diff viewer cleanup is coming in later changes. Testing Done: - Tested a ton of different diffs, both in the diff viewer as well as the text and markdown Review UIs. Verified that all types of edits (added/edited/deleted/moved) lines all worked as expected. - Ran unit tests.
9daae1bba689c7145420b91cfb69c0aee56928ea
Diff:

Revision 10 (+1502 -616)

Show changes

reviewboard/diffviewer/chunk_generator.py
reviewboard/diffviewer/diffutils.py
reviewboard/diffviewer/templatetags/difftags.py
reviewboard/diffviewer/tests/test_diff_chunk_generator.py
reviewboard/diffviewer/tests/test_raw_diff_chunk_generator.py
reviewboard/reviews/tests/test_markdown_review_ui.py
reviewboard/reviews/tests/test_text_based_review_ui.py
reviewboard/webapi/resources/filediff.py
1 more

Checks run (2 succeeded)

flake8 passed.
JSHint passed.
Loading...