Introduce loose commit validation rules for Mercurial diffs.

Review Request #14034 — Created July 12, 2024 and submitted

Information

Review Board
release-7.x

Reviewers

Mercurial diffs (both standard and Git-like) lack any information
connecting a file to the last revision or commit that modified or
introduced that file. This poses problems for the way we do diff
validation, since we can't determine the true parent for any given file,
and the end result today is that files that were introduced in a parent
commit would fail to validate in a later commit.

To address this, we need a second, looser validation mode where we only
compare filenames, along with storage of the most-likely revision.

Normally, when performing a validation, we walk through the parent ID
sequence in the commit validation graph and try to find a file entry
that matches both the path and the revision. If we can't find a match,
we check the repository. If we still can't find it, the file doesn't
exist.

In a loose validation mode, we still walk through the graph but we only
check the file paths, not the revisions. We assume if we find any file
path match, then that commit is the correct parent for the file we're
looking for. We can then store that commit ID and store it as the new
source revision for the file, so that we can look it up later when
walking ancestors of a diff during rendering of the diff viewer.

This isn't ideal, as the strict validation is intended to catch
malformed comit chains, merge commits, or just bad uploads from tools.
We now can't know for sure if commits are in the right order or if
anything is missing. But unfortunately, without assistance from a
Mercurial diff, this is the best we can do.

This change implements the loose validation mode through a new
has_per_file_revisions flag stored in a ParsedDiff and therefore
FileLookupContext's diff-wide extra_data. This is considered True
by default, but can be changed on BaseDiffParser subclasses by
setting has_per_file_revisions = False.

During loose validation, get_file_exists_in_history() will look for a
probable-matching file and, if found, record the commit ID in
FileLookupContext.file_extra_Data['__validated_parent_id']. This is
then extracted and removed when creating the FileDiffs and set as the
new source_revision.

All of this logic is restricted to diff parsers that opt out of per-file
revisions, so just Mercurial. If we were to add DiffX support in the
future, we'd be able to return to strict validation for Mercurial.

All unit tests passed.

Tested this with some Mercurial repro cases where diff validation
would fail. The repro case consists of 3 commits. The first two
commits each introduce a new file, and the third commit modifies
one or both of them. This previously failed diff validation, and
now succeeds.

Tested the resulting review requests, making sure I could view
every combination of commits.

Tested these with Git, making sure no behavioral changes were
introduced.

Summary ID
Introduce loose commit validation rules for Mercurial diffs.
Mercurial diffs (both standard and Git-like) lack any information connecting a file to the last revision or commit that modified or introduced that file. This poses problems for the way we do diff validation, since we can't determine the true parent for any given file, and the end result today is that files that were introduced in a parent commit would fail to validate in a later commit. To address this, we need a second, looser validation mode where we only compare filenames, along with storage of the most-likely revision. Normally, when performing a validation, we walk through the parent ID sequence in the commit validation graph and try to find a file entry that matches both the path and the revision. If we can't find a match, we check the repository. If we still can't find it, the file doesn't exist. In a loose validation mode, we still walk through the graph but we only check the file paths, not the revisions. We assume if we find any file path match, then that commit is the correct parent for the file we're looking for. We can then store that commit ID and store it as the new source revision for the file, so that we can look it up later when walking ancestors of a diff during rendering of the diff viewer. This isn't ideal, as the strict validation is intended to catch malformed comit chains, merge commits, or just bad uploads from tools. We now can't know for sure if commits are in the right order or if anything is missing. But unfortunately, without assistance from a Mercurial diff, this is the best we can do. This change implements the loose validation mode through a new `has_per_file_revisions` flag stored in a `ParsedDiff` and therefore `FileLookupContext`'s diff-wide `extra_data`. This is considered `True` by default, but can be changed on `BaseDiffParser` subclasses by setting `has_per_file_revisions = False`. During loose validation, `get_file_exists_in_history()` will look for a probable-matching file and, if found, record the commit ID in `FileLookupContext.file_extra_Data['__validated_parent_id']`. This is then extracted and removed when creating the `FileDiff`s and set as the new `source_revision`. All of this logic is restricted to diff parsers that opt out of per-file revisions, so just Mercurial. If we were to add DiffX support in the future, we'd be able to return to strict validation for Mercurial.
37e7dcb4306c48278b32ba840249d465eb4595c3
Description From Last Updated

'typing.Dict' imported but unused Column: 1 Error code: F401

reviewbotreviewbot

'typing.Sequence' imported but unused Column: 1 Error code: F401

reviewbotreviewbot

block comment should start with '# ' Column: 9 Error code: E265

reviewbotreviewbot

This is wrong, has_per_file_revisions has to be set to False in order to opt in to loose validation.

maubinmaubin

Typo: il-formed -> ill-formed

maubinmaubin
Checks run (1 failed, 1 succeeded)
flake8 failed.
JSHint passed.

flake8

chipx86
david
  1. Ship It!
  2. 
      
maubin
  1. 
      
  2. reviewboard/diffviewer/commit_utils.py (Diff revision 2)
     
     
     
     
     
     
     
     
    Show all issues

    This is wrong, has_per_file_revisions has to be set to False in order to opt in to loose validation.

  3. reviewboard/scmtools/hg.py (Diff revision 2)
     
     
    Show all issues

    Typo: il-formed -> ill-formed

  4. 
      
chipx86
maubin
  1. Ship It!
  2. 
      
chipx86
Review request changed
Status:
Completed
Change Summary:
Pushed to release-7.x (c896678)