Introduce loose commit validation rules for Mercurial diffs.

Review Request #14034 — Created July 12, 2024 and submitted — Latest diff uploaded

Information

Review Board
release-7.x

Reviewers

Mercurial diffs (both standard and Git-like) lack any information
connecting a file to the last revision or commit that modified or
introduced that file. This poses problems for the way we do diff
validation, since we can't determine the true parent for any given file,
and the end result today is that files that were introduced in a parent
commit would fail to validate in a later commit.

To address this, we need a second, looser validation mode where we only
compare filenames, along with storage of the most-likely revision.

Normally, when performing a validation, we walk through the parent ID
sequence in the commit validation graph and try to find a file entry
that matches both the path and the revision. If we can't find a match,
we check the repository. If we still can't find it, the file doesn't
exist.

In a loose validation mode, we still walk through the graph but we only
check the file paths, not the revisions. We assume if we find any file
path match, then that commit is the correct parent for the file we're
looking for. We can then store that commit ID and store it as the new
source revision for the file, so that we can look it up later when
walking ancestors of a diff during rendering of the diff viewer.

This isn't ideal, as the strict validation is intended to catch
malformed comit chains, merge commits, or just bad uploads from tools.
We now can't know for sure if commits are in the right order or if
anything is missing. But unfortunately, without assistance from a
Mercurial diff, this is the best we can do.

This change implements the loose validation mode through a new
has_per_file_revisions flag stored in a ParsedDiff and therefore
FileLookupContext's diff-wide extra_data. This is considered True
by default, but can be changed on BaseDiffParser subclasses by
setting has_per_file_revisions = False.

During loose validation, get_file_exists_in_history() will look for a
probable-matching file and, if found, record the commit ID in
FileLookupContext.file_extra_Data['__validated_parent_id']. This is
then extracted and removed when creating the FileDiffs and set as the
new source_revision.

All of this logic is restricted to diff parsers that opt out of per-file
revisions, so just Mercurial. If we were to add DiffX support in the
future, we'd be able to return to strict validation for Mercurial.

All unit tests passed.

Tested this with some Mercurial repro cases where diff validation
would fail. The repro case consists of 3 commits. The first two
commits each introduce a new file, and the third commit modifies
one or both of them. This previously failed diff validation, and
now succeeds.

Tested the resulting review requests, making sure I could view
every combination of commits.

Tested these with Git, making sure no behavioral changes were
introduced.

Diff Revision 1

This is not the most recent revision of the diff. The latest diff is revision 3. See what's changed.

orig
1
2
3

Commits

First Last Summary ID Author
Introduce loose commit validation rules for Mercurial diffs.
Mercurial diffs (both standard and Git-like) lack any information connecting a file to the last revision or commit that modified or introduced that file. This poses problems for the way we do diff validation, since we can't determine the true parent for any given file, and the end result today is that files that were introduced in a parent commit would fail to validate in a later commit. To address this, we need a second, looser validation mode where we only compare filenames, along with storage of the most-likely revision. Normally, when performing a validation, we walk through the parent ID sequence in the commit validation graph and try to find a file entry that matches both the path and the revision. If we can't find a match, we check the repository. If we still can't find it, the file doesn't exist. In a loose validation mode, we still walk through the graph but we only check the file paths, not the revisions. We assume if we find any file path match, then that commit is the correct parent for the file we're looking for. We can then store that commit ID and store it as the new source revision for the file, so that we can look it up later when walking ancestors of a diff during rendering of the diff viewer. This isn't ideal, as the strict validation is intended to catch malformed comit chains, merge commits, or just bad uploads from tools. We now can't know for sure if commits are in the right order or if anything is missing. But unfortunately, without assistance from a Mercurial diff, this is the best we can do. This change implements the loose validation mode through a new `has_per_file_revisions` flag stored in a `ParsedDiff` and therefore `FileLookupContext`'s diff-wide `extra_data`. This is considered `True` by default, but can be changed on `BaseDiffParser` subclasses by setting `has_per_file_revisions = False`. During loose validation, `get_file_exists_in_history()` will look for a probable-matching file and, if found, record the commit ID in `FileLookupContext.file_extra_Data['__validated_parent_id']`. This is then extracted and removed when creating the `FileDiff`s and set as the new `source_revision`. All of this logic is restricted to diff parsers that opt out of per-file revisions, so just Mercurial. If we were to add DiffX support in the future, we'd be able to return to strict validation for Mercurial.
493902803b1bfbd4cc971669f1b6b22962d1d204 Christian Hammond
reviewboard/diffviewer/commit_utils.py
Revision 29d89e73285a70dd9aaf2169d0d05af95daa4e8f New Change
1
"""Utilities for dealing with DiffCommits."""
1
"""Utilities for dealing with DiffCommits."""
2

    
   
2

   
3
from __future__ import annotations
3
from __future__ import annotations
4

    
   
4

   
5
import base64
5
import base64
6
import json
6
import json
7
from itertools import chain
7
from itertools import chain
2

    
   
8
from typing import Any, Dict, Optional, Sequence, TYPE_CHECKING
8

    
   
9

   
9
from django.utils.encoding import force_bytes
10
from django.utils.encoding import force_bytes
10
from typing_extensions import NotRequired, TypedDict
11
from typing_extensions import NotRequired, TypedDict
11

    
   
12

   
12
from reviewboard.scmtools.core import PRE_CREATION, UNKNOWN
13
from reviewboard.scmtools.core import PRE_CREATION, UNKNOWN
13

    
   
14

   

    
   
15
if TYPE_CHECKING:

    
   
16
    from reviewboard.scmtools.core import FileLookupContext

    
   
17
    from reviewboard.scmtools.models import Repository
14

    
   
18

   
15
def get_file_exists_in_history(validation_info, repository, parent_id, path,
19

   
16
                               revision, **kwargs):
20
def get_file_exists_in_history(

    
   
21
    validation_info: dict[str, Any],

    
   
22
    repository: Repository,

    
   
23
    parent_id: str,

    
   
24
    path: str,

    
   
25
    revision: str,

    
   
26
    base_commit_id: Optional[str] = None,

    
   
27
    *,

    
   
28
    context: Optional[FileLookupContext] = None,

    
   
29
    **kwargs,

    
   
30
) -> bool:
17
    """Return whether or not the file exists, given the validation information.
31
    """Return whether or not the file exists, given the validation information.
18

    
   
32

   

    
   
33
    This will walk through the commit chain in ``validation_info``, starting

    
   
34
    with ``parent_id``, looking for a file entry that matches the given

    
   
35
    ``path`` and (by default) ``revision``. If found, the file is considered

    
   
36
    to exist. If not found, it will fall back to checking the repository.

    
   
37

   

    
   
38
    This can also operate in a loose validation mode. In this mode, only

    
   
39
    file paths are compared, not revisions. This is required for diffs that

    
   
40
    don't provide per-file revision information (such as Mercurial's plain

    
   
41
    and Git-style diffs). If a match is found, the associated commit ID is

    
   
42
    stored as ``context.file_extra_data['__validated_parent_id']``, allowing

    
   
43
    for post-processing of the source revision to occur.

    
   
44

   

    
   
45
    Strict validation mode is the default. Loose validation must be opted into

    
   
46
    by setting ``context.diff_extra_data['has_per_file_revisions'] = True``.

    
   
47
    This is done automatically when a

    
   
48
    :py:class:`~reviewboard.diffviewer.parser.BaseDiffParser` subclass sets

    
   
49
    :py:attr:`has_per_file_revisions

    
   
50
    <reviewboard.diffviewer.parser.BaseDiffParser.has_per_file_revisions>`

    
   
51
    to ``True``.

    
   
52

   

    
   
53
    Version Changed:

    
   
54
        7.0.2:

    
   
55
        * Added the ``context`` argument.

    
   
56
        * Added the loose validation mode.

    
   
57

   
19
    Version Changed:
58
    Version Changed:
20
        4.0.5:
59
        4.0.5:
21
        Removed explicit arguments for ``base_commit_id`` and ``request``, and
60
        Removed explicit arguments for ``base_commit_id`` and ``request``, and
22
        added ``**kwargs``.
61
        added ``**kwargs``.
23

    
   
62

   
4 lines
def get_file_exists_in_history(validation_info, repository, parent_id, path,
28
            ValidateDiffCommitResource`.
67
            ValidateDiffCommitResource`.
29

    
   
68

   
30
        repository (reviewboard.scmtools.models.Repository):
69
        repository (reviewboard.scmtools.models.Repository):
31
            The repository.
70
            The repository.
32

    
   
71

   
33
        parent_id (unicode):
72
        parent_id (str):
34
            The parent commit ID of the commit currently being processed.
73
            The parent commit ID of the commit currently being processed.
35

    
   
74

   
36
        path (unicode):
75
        path (str):
37
            The file path.
76
            The file path.
38

    
   
77

   
39
        revision (unicode):
78
        revision (str):
40
            The revision of the file to retrieve.
79
            The revision of the file to retrieve.
41

    
   
80

   

    
   
81
        context (reviewboard.scmtools.core.FileLookupContext, optional):

    
   
82
            The file lookup context used to validate this repository.

    
   
83

   

    
   
84
            Version Added:

    
   
85
                7.0.2

    
   
86

   
42
        **kwargs (dict):
87
        **kwargs (dict):
43
            Additional keyword arguments normally expected by
88
            Additional keyword arguments normally expected by
44
            :py:meth:`Repository.get_file_exists
89
            :py:meth:`Repository.get_file_exists
45
            <reviewboard.scmtools.models.Repository.get_file_exists>`.
90
            <reviewboard.scmtools.models.Repository.get_file_exists>`.
46

    
   
91

   
47
            Version Added:
92
            Version Added:
48
                4.0.5
93
                4.0.5
49

    
   
94

   
50
    Returns:
95
    Returns:
51
        bool:
96
        bool:
52
        Whether or not the file exists.
97
        Whether or not the file exists.
53
    """
98
    """

    
   
99
    match_parent_revisions = (

    
   
100
        context is None or

    
   
101
        context.diff_extra_data.get('has_per_file_revisions', True))

    
   
102

   
54
    while parent_id in validation_info:
103
    while parent_id in validation_info:
55
        entry = validation_info[parent_id]
104
        entry = validation_info[parent_id]
56
        tree = entry['tree']
105
        tree = entry['tree']
57

    
   
106

   
58
        if revision == UNKNOWN:
107
        if revision == UNKNOWN:
59
            for removed_info in tree['removed']:
108
            for removed_info in tree['removed']:
60
                if removed_info['filename'] == path:
109
                if removed_info['filename'] == path:
61
                    return False
110
                    return False
62

    
   
111

   
63
        for added_info in chain(tree['added'], tree['modified']):
112
        for change_info in chain(tree['added'], tree['modified']):
64
            if (added_info['filename'] == path and
113
            if change_info['filename'] == path:
65
                (revision == UNKNOWN or
114
                if revision == UNKNOWN:
66
                 added_info['revision'] == revision)):
115
                    return True

    
   
116

   

    
   
117
                if match_parent_revisions:

    
   
118
                    # In the standard case, we have per-file revisions, and

    
   
119
                    # can use that to get a specific match within the

    
   
120
                    # validation history.

    
   
121
                    if change_info['revision'] == revision:

    
   
122
                        return True

    
   
123
                else:

    
   
124
                    # In a more limited case, we may not know the per-file

    
   
125
                    # revision, and instead have to limit our scan to

    
   
126
                    # the nearest filename. This is the case with Mercurial

    
   
127
                    # diffs.

    
   
128
                    #

    
   
129
                    # We'll be recording what we found, temporarily. This

    
   
130
                    # will be used to update the source filename of a

    
   
131
                    # generated FileDiff.

    
   
132
                    assert context is not None

    
   
133

   

    
   
134
                    context.file_extra_data['__validated_parent_id'] = \

    
   
135
                        parent_id

    
   
136

   
67
                return True
137
>>>>                return True
68

    
   
138

   
69
        parent_id = entry['parent_id']
139
        parent_id = entry['parent_id']
70

    
   
140

   
71
    # We did not find an entry in our validation info, so we need to fall back
141
    # We did not find an entry in our validation info, so we need to fall back
72
    # to checking the repository.
142
    # to checking the repository.
73
    return repository.get_file_exists(path=path,
143
    return repository.get_file_exists(path=path,
74
                                      revision=revision,
144
                                      revision=revision,

    
   
145
                                      base_commit_id=base_commit_id,

    
   
146
                                      context=context,
75
                                      **kwargs)
147
                                      **kwargs)
76

    
   
148

   
77

    
   
149

   
78
def exclude_ancestor_filediffs(to_filter, all_filediffs=None):
150
def exclude_ancestor_filediffs(to_filter, all_filediffs=None):
79
    """Exclude all ancestor FileDiffs from the given list and return the rest.
151
    """Exclude all ancestor FileDiffs from the given list and return the rest.
420 lines
reviewboard/diffviewer/filediff_creator.py
reviewboard/diffviewer/parser.py
reviewboard/diffviewer/tests/test_commit_utils.py
reviewboard/diffviewer/tests/test_filediff_creator.py
reviewboard/scmtools/hg.py
Loading...