• 
      
    Fish Trophy

    chipx86 got a fish trophy!

    Fish Trophy

    Store new and migrated diffs as raw, compressed binary data.

    Review Request #6226 — Created Aug. 13, 2014 and submitted

    Information

    Review Board
    master
    5a3696c...

    Reviewers

    Historically, due to the limitations of Django's support for database
    field types, we had to store diff data as base64-encoded data. This used
    to be stored on FileDiffs, and then we migrated de-duplicated
    FileDiffData entries. Still, the size of the diffs added up
    considerably.

    This change modernizes the storage by making use of Django 1.6's
    BinaryField, which allows for storing raw binary content. That means we
    can store a diff as-is without dealing with base64. We then go a step
    further to bzip2 compress the content. If the resulting compressed size
    is smaller than the raw diff content, we store that instead. For larger
    diffs, this saves a lot of space.

    This new content is stored in a new RawFileDiffData table, which is
    similar to FileDiffData, but with a few key differences. It uses
    BinaryField, of course, but it also has a field for the compression type
    (we can add additional entries down the road), and it uses numeric IDs
    for the primary key. FileDiffData used hashes of the diff content, which
    inflated the FileDiff table, requiring 40 bytes for the key value. That
    also slowed down lookups, since a 40 byte string is more expensive than
    a numeric ID. This hash still exists, but is in its own field now.

    FileDiffData has been renamed to LegacyFileDiffData. Likewise, the
    fields in FileDiff referencing that table have been renamed in the
    model. To prevent expensive database upgrades, the table name and
    the columns in FileDiff referencing it have not been renamed. The
    renames are done purely on the Python side.

    When accessing the diff content associated with a FileDiff, we
    automatically perform the migration to RawFileDiffData. This happens
    whether FileDiff is storing its own diff content, or whether it's using
    a LegacyFileDiffData for storage.

    This results in a typical savings of ~80% on production databases.

    Clicked several diffs and saw that the content was migrated out of
    LegacyFileDiffData and into RawFileDiffData for those diffs.

    Unit tests pass.

    Ran condensediffs (coming in another change) on sqlite, MySQL, and
    PostgreSQL. Ran this many times during development, actually. It
    succeeded with no LegacyFileDiffData entries remaining, and the
    same number of RawFileDiffData entries to match.

    On sqlite and MySQL, I wrote some code to dump out FileDiff IDs and
    their diff and parent diff binary hashes. I ran this before and after
    a conversion. The results were identical, meaning that all data was
    properly associated and represented.

    After all this, I had no issues loading diffs that were converted.

    Description From Last Updated

    This doesn't seem like it really ought to be localized, but if you really want to, the string should be …

    daviddavid

    Col: 80 E501 line too long (80 > 79 characters)

    reviewbotreviewbot
    reviewbot
    1. Tool: Pyflakes
      Processed Files:
          reviewboard/diffviewer/models.py
          reviewboard/diffviewer/tests.py
          reviewboard/diffviewer/evolutions/__init__.py
          reviewboard/diffviewer/evolutions/raw_diff_file_data.py
          setup.py
          reviewboard/diffviewer/managers.py
      
      
      
      Tool: PEP8 Style Checker
      Processed Files:
          reviewboard/diffviewer/models.py
          reviewboard/diffviewer/tests.py
          reviewboard/diffviewer/evolutions/__init__.py
          reviewboard/diffviewer/evolutions/raw_diff_file_data.py
          setup.py
          reviewboard/diffviewer/managers.py
      
      
    2. reviewboard/diffviewer/tests.py (Diff revision 1)
       
       
      Show all issues
      Col: 80
       E501 line too long (80 > 79 characters)
      
    3. 
        
    david
    1. 
        
    2. reviewboard/diffviewer/models.py (Diff revision 1)
       
       
       
      Show all issues

      This doesn't seem like it really ought to be localized, but if you really want to, the string should be formatted using named format specifiers.

    3. 
        
    chipx86
    reviewbot
    1. Tool: Pyflakes
      Processed Files:
          reviewboard/diffviewer/models.py
          reviewboard/diffviewer/tests.py
          reviewboard/diffviewer/evolutions/__init__.py
          reviewboard/diffviewer/evolutions/raw_diff_file_data.py
          setup.py
          reviewboard/diffviewer/managers.py
      
      
      
      Tool: PEP8 Style Checker
      Processed Files:
          reviewboard/diffviewer/models.py
          reviewboard/diffviewer/tests.py
          reviewboard/diffviewer/evolutions/__init__.py
          reviewboard/diffviewer/evolutions/raw_diff_file_data.py
          setup.py
          reviewboard/diffviewer/managers.py
      
      
    2. 
        
    david
    1. Ship It!

    2. 
        
    chipx86
    Review request changed
    Status:
    Completed
    Change Summary:
    Pushed to master (5e05802)