Add better MySQL compatibility, new arguments, and fixes to condensediffs.

Review Request #10909 — Created Feb. 21, 2020 and submitted

Information

Review Board
release-3.0.x
0a0acb8...

Reviewers

condensediffs has been around since Review Board 1.7, and has changed
a bit over the years to accommodate changes to our diff storage, but has
largely still remained the same command. This is the first major change
to condensediffs and to the diff migration machinery since the
introduction of the compressed diff storage in Review Board 2.5.

When we moved to the new storage mechanism and updated the migration
process, we uncovered a bug that affected MySQL users using InnoDB for
the diffviewer_filediffdata table (in other words, most MySQL users)
where fetching a row count on that table could take a very long time.
This is due to the fact that InnoDB does not store accurate row counts,
requiring that the table is walked every time a count is needed
(database-side query caching helps to mitigate this), and this can cause
some serious issues with large databases.

In order to address that, and to improve migration overall for large
databases, this change makes a number of improvements to condensediffs
and to the overall diff migration process.

  1. On MySQL, we now use an alternative method to fetch row counts. On
    MyISAM, this will be accurate, but on InnoDB it will be an
    approximation (which may be quite wrong). Still, it's fast. We warn
    the user about this when it happens, and tell them that it may be
    inaccurate but it's safe.

  2. Row counts are no longer a factor in the migration process. They are
    now purely informative, used for progress reporting. That means that
    an inaccurate row count on InnoDB won't break anything.

  3. condensediffs can now be told to perform a migration without even
    fetching any counts by using --no-progress. This provides a simple
    "XX diffs migrated" status message, and is suitable for running in
    scheduled migration operations.

  4. condensediffs can also now be told to migrate only a specified
    number of diffs by using --max-diffs=<value>. This is also helpful
    for scheduled migrations, when you may want to process a certain
    number per night while the server isn't being used as heavily.

  5. It can also now show the total number of unmigrated FileDiff and
    LegacyFileDiffData objects, allowing people to see what's left
    without having to run a migration and Control-C it.

  6. Some legacy code around FileDiff migrations weren't showing
    accurate diff savings (showing 0% or close to it). The calculations
    made sense when converting to the LegacyFileDiffData, but not
    RawFileDiffData. They've been updated to show the savings from that
    conversion.

  7. Calculating the number of unmigrated FileDiffs is now faster. We're
    no longer trying to compare diff content against empty strings, and
    instead just comparing references to the diff data tables.

Documentation has also been updated to better help understand the code
and processes.

Populated a database with thousands of FileDiff and LegacyFileDiffData
objects. Performed dozens and dozens of migrations against copies of this
database, using all the new command line arguments to test that the
functionality all worked as expected.

Tested --max-diffs with 0, with values that align with the batch and
object fetch limit boundaries, with the total number of FileDiffs,
total number of LegacyFileDiffDatas, the total number for both, values
that surpass that, and all sorts of numbers in-between (including funky,
non-round numbers). Verified that the requested number never broke
anything, and that that number of diffs was always migrated (unless it
was more than the number in the database).

Tested progress reporting with a bad estimate (both under and over the
actual row count). Verified the finalized progress update showed 100%
and reflected the actual number of diffs processed.

Tested the row calculation on MySQL, and that the warning was displayed
when using InnoDB.

Description From Last Updated

These should be passed in as args to logger.exception instead of formatting ahead of time.

daviddavid

Missing the format parameter

daviddavid

hrm, this is pretty confusing and I think it might be wrong. Shouldn't be be doing queryset[total_processed:limit] here?

daviddavid
david
  1. 
      
  2. reviewboard/diffviewer/managers.py (Diff revision 1)
     
     
    Show all issues

    These should be passed in as args to logger.exception instead of formatting ahead of time.

  3. reviewboard/diffviewer/managers.py (Diff revision 1)
     
     
     
    Show all issues

    Missing the format parameter

  4. reviewboard/diffviewer/managers.py (Diff revision 1)
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    hrm, this is pretty confusing and I think it might be wrong.

    Shouldn't be be doing queryset[total_processed:limit] here?

    1. It's correct, with the explanation about why being in the comment block above (though, no longer immediately above the for loop, since we're now calculating the limit as part of the operation).

      When we're starting that loop, we're in one of two modes:

      1. We don't have anything in the batch of pending objects yet, so batch_len == 0. We're therefore fetching up to the limit.
      2. We have a partial batch, but we haven't yielded it yet. This is because we've already fetched the per-query limt (object_limit), but haven't filled up the batch to batch_size yet. So we want to start where we left off, and then go up to the limit.

      When we have filled up a batch, we yield it. The migration code for that type then processes it and clears out the items that would be matched by that queryset (outright deleting LegacyFileDiffData entries, or making changes to FileDiffs that disassociate the diffs -- in either case, reducing the subset of items matched in queryset).

      Because of that, the next time we do queryset[:limit].iterator(), we're getting the next set of items.

      If we instead used total_processed:limit, we'd end up skipping items. If our yielded batch was 20 items, and total_processed was then 20, then queryset[total_processed:limit] would skip a batch (original 20 items excluded as normal from the queryset, and then going 20 into the next batch, skipping the 20 in-between).

  5. 
      
chipx86
david
  1. Ship It!
  2. 
      
chipx86
Review request changed
Status:
Completed
Change Summary:
Pushed to release-3.0.x (5279387)