Fish Trophy

chipx86 got a fish trophy!

Fish Trophy

Store new and migrated diffs as raw, compressed binary data.

Review Request #6226 — Created Aug. 13, 2014 and submitted

Information

Review Board
master
5a3696c...

Reviewers

Historically, due to the limitations of Django's support for database
field types, we had to store diff data as base64-encoded data. This used
to be stored on FileDiffs, and then we migrated de-duplicated
FileDiffData entries. Still, the size of the diffs added up
considerably.

This change modernizes the storage by making use of Django 1.6's
BinaryField, which allows for storing raw binary content. That means we
can store a diff as-is without dealing with base64. We then go a step
further to bzip2 compress the content. If the resulting compressed size
is smaller than the raw diff content, we store that instead. For larger
diffs, this saves a lot of space.

This new content is stored in a new RawFileDiffData table, which is
similar to FileDiffData, but with a few key differences. It uses
BinaryField, of course, but it also has a field for the compression type
(we can add additional entries down the road), and it uses numeric IDs
for the primary key. FileDiffData used hashes of the diff content, which
inflated the FileDiff table, requiring 40 bytes for the key value. That
also slowed down lookups, since a 40 byte string is more expensive than
a numeric ID. This hash still exists, but is in its own field now.

FileDiffData has been renamed to LegacyFileDiffData. Likewise, the
fields in FileDiff referencing that table have been renamed in the
model. To prevent expensive database upgrades, the table name and
the columns in FileDiff referencing it have not been renamed. The
renames are done purely on the Python side.

When accessing the diff content associated with a FileDiff, we
automatically perform the migration to RawFileDiffData. This happens
whether FileDiff is storing its own diff content, or whether it's using
a LegacyFileDiffData for storage.

This results in a typical savings of ~80% on production databases.

Clicked several diffs and saw that the content was migrated out of
LegacyFileDiffData and into RawFileDiffData for those diffs.

Unit tests pass.

Ran condensediffs (coming in another change) on sqlite, MySQL, and
PostgreSQL. Ran this many times during development, actually. It
succeeded with no LegacyFileDiffData entries remaining, and the
same number of RawFileDiffData entries to match.

On sqlite and MySQL, I wrote some code to dump out FileDiff IDs and
their diff and parent diff binary hashes. I ran this before and after
a conversion. The results were identical, meaning that all data was
properly associated and represented.

After all this, I had no issues loading diffs that were converted.

Description From Last Updated

This doesn't seem like it really ought to be localized, but if you really want to, the string should be …

daviddavid

Col: 80 E501 line too long (80 > 79 characters)

reviewbotreviewbot
reviewbot
  1. Tool: Pyflakes
    Processed Files:
        reviewboard/diffviewer/models.py
        reviewboard/diffviewer/tests.py
        reviewboard/diffviewer/evolutions/__init__.py
        reviewboard/diffviewer/evolutions/raw_diff_file_data.py
        setup.py
        reviewboard/diffviewer/managers.py
    
    
    
    Tool: PEP8 Style Checker
    Processed Files:
        reviewboard/diffviewer/models.py
        reviewboard/diffviewer/tests.py
        reviewboard/diffviewer/evolutions/__init__.py
        reviewboard/diffviewer/evolutions/raw_diff_file_data.py
        setup.py
        reviewboard/diffviewer/managers.py
    
    
  2. reviewboard/diffviewer/tests.py (Diff revision 1)
     
     
    Show all issues
    Col: 80
     E501 line too long (80 > 79 characters)
    
  3. 
      
david
  1. 
      
  2. reviewboard/diffviewer/models.py (Diff revision 1)
     
     
     
    Show all issues

    This doesn't seem like it really ought to be localized, but if you really want to, the string should be formatted using named format specifiers.

  3. 
      
chipx86
reviewbot
  1. Tool: Pyflakes
    Processed Files:
        reviewboard/diffviewer/models.py
        reviewboard/diffviewer/tests.py
        reviewboard/diffviewer/evolutions/__init__.py
        reviewboard/diffviewer/evolutions/raw_diff_file_data.py
        setup.py
        reviewboard/diffviewer/managers.py
    
    
    
    Tool: PEP8 Style Checker
    Processed Files:
        reviewboard/diffviewer/models.py
        reviewboard/diffviewer/tests.py
        reviewboard/diffviewer/evolutions/__init__.py
        reviewboard/diffviewer/evolutions/raw_diff_file_data.py
        setup.py
        reviewboard/diffviewer/managers.py
    
    
  2. 
      
david
  1. Ship It!

  2. 
      
chipx86
Review request changed
Status:
Completed
Change Summary:
Pushed to master (5e05802)