Summary

Add better MySQL compatibility, new arguments, and fixes to condensediffs.

Review Request #10909 — Created Feb. 21, 2020 and submitted March 12, 2020, 6:53 p.m.

Information

Owner

chipx86

Repository

Review Board

Branch

release-3.0.x

Bugs

Depends On

Commit

0a0acb8...

Reviewers

Groups

reviewboard

People

Description

condensediffs has been around since Review Board 1.7, and has changed
a bit over the years to accommodate changes to our diff storage, but has
largely still remained the same command. This is the first major change
to condensediffs and to the diff migration machinery since the
introduction of the compressed diff storage in Review Board 2.5.

When we moved to the new storage mechanism and updated the migration
process, we uncovered a bug that affected MySQL users using InnoDB for
the diffviewer_filediffdata table (in other words, most MySQL users)
where fetching a row count on that table could take a very long time.
This is due to the fact that InnoDB does not store accurate row counts,
requiring that the table is walked every time a count is needed
(database-side query caching helps to mitigate this), and this can cause
some serious issues with large databases.

In order to address that, and to improve migration overall for large
databases, this change makes a number of improvements to condensediffs
and to the overall diff migration process.

On MySQL, we now use an alternative method to fetch row counts. On
MyISAM, this will be accurate, but on InnoDB it will be an
approximation (which may be quite wrong). Still, it's fast. We warn
the user about this when it happens, and tell them that it may be
inaccurate but it's safe.
Row counts are no longer a factor in the migration process. They are
now purely informative, used for progress reporting. That means that
an inaccurate row count on InnoDB won't break anything.
condensediffs can now be told to perform a migration without even
fetching any counts by using --no-progress. This provides a simple
"XX diffs migrated" status message, and is suitable for running in
scheduled migration operations.
condensediffs can also now be told to migrate only a specified
number of diffs by using --max-diffs=<value>. This is also helpful
for scheduled migrations, when you may want to process a certain
number per night while the server isn't being used as heavily.
It can also now show the total number of unmigrated FileDiff and
LegacyFileDiffData objects, allowing people to see what's left
without having to run a migration and Control-C it.
Some legacy code around FileDiff migrations weren't showing
accurate diff savings (showing 0% or close to it). The calculations
made sense when converting to the LegacyFileDiffData, but not
RawFileDiffData. They've been updated to show the savings from that
conversion.
Calculating the number of unmigrated FileDiffs is now faster. We're
no longer trying to compare diff content against empty strings, and
instead just comparing references to the diff data tables.

Documentation has also been updated to better help understand the code
and processes.

Testing Done

Populated a database with thousands of FileDiff and LegacyFileDiffData
objects. Performed dozens and dozens of migrations against copies of this
database, using all the new command line arguments to test that the
functionality all worked as expected.

Tested --max-diffs with 0, with values that align with the batch and
object fetch limit boundaries, with the total number of FileDiffs,
total number of LegacyFileDiffDatas, the total number for both, values
that surpass that, and all sorts of numbers in-between (including funky,
non-round numbers). Verified that the requested number never broke
anything, and that that number of diffs was always migrated (unless it
was more than the number in the database).

Tested progress reporting with a bad estimate (both under and over the
actual row count). Verified the finalized progress update showed 100%
and reflected the actual number of diffs processed.

Tested the row calculation on MySQL, and that the warning was displayed
when using InnoDB.

Issues

Description	From	Last Updated
These should be passed in as args to logger.exception instead of formatting ahead of time.	david	Feb. 24, 2020, 1:33 p.m.
Missing the format parameter	david	Feb. 24, 2020, 1:33 p.m.
hrm, this is pretty confusing and I think it might be wrong. Shouldn't be be doing queryset[total_processed:limit] here?	david	Feb. 23, 2020, 4:18 p.m.

flake8 passed.

JSHint passed.

reviewboard/diffviewer/managers.py (Diff revision 1)

The issue has been resolved. Show all issues

These should be passed in as args to logger.exception instead of formatting ahead of time.

reviewboard/diffviewer/managers.py (Diff revision 1)
The issue has been resolved. Show all issues
```
Missing the format parameter
```

reviewboard/diffviewer/managers.py (Diff revision 1)

The issue has been dropped. Show all issues

hrm, this is pretty confusing and I think it might be wrong.

Shouldn't be be doing queryset[total_processed:limit] here?

chipx86 Feb. 23, 2020, 4:18 p.m.

It's correct, with the explanation about why being in the comment block above (though, no longer immediately above the for loop, since we're now calculating the limit as part of the operation).

When we're starting that loop, we're in one of two modes:

We don't have anything in the batch of pending objects yet, so batch_len == 0. We're therefore fetching up to the limit.
We have a partial batch, but we haven't yielded it yet. This is because we've already fetched the per-query limt (object_limit), but haven't filled up the batch to batch_size yet. So we want to start where we left off, and then go up to the limit.

When we have filled up a batch, we yield it. The migration code for that type then processes it and clears out the items that would be matched by that queryset (outright deleting LegacyFileDiffData entries, or making changes to FileDiffs that disassociate the diffs -- in either case, reducing the subset of items matched in queryset).

Because of that, the next time we do queryset[:limit].iterator(), we're getting the next set of items.

If we instead used total_processed:limit, we'd end up skipping items. If our yielded batch was 20 items, and total_processed was then 20, then queryset[total_processed:limit] would skip a batch (original 20 items excluded as normal from the queryset, and then going 20 into the next batch, skipping the 20 in-between).

Change Summary:

Fixed some arguments to logging functions.

Commit:

67c5fd7bf8340fbbff3a56d52b412737614cf09e

0a0acb83cf0a63abd5d3f1e83ff2210f49e67a16

Diff:

Revision 2 (+478 -127)

Show changes

	reviewboard/diffviewer/managers.py
	reviewboard/diffviewer/management/commands/condensediffs.py

Checks run (2 succeeded)

flake8 passed.

JSHint passed.

Ship it!

```
Ship It!
```

Status:: Completed
Change Summary:: Pushed to release-3.0.x (5279387)