• 
      
    Fish Trophy

    chipx86 got a fish trophy!

    Fish Trophy

    Improve performance when rendering diff lines.

    Review Request #5555 — Created Feb. 27, 2014 and submitted

    Information

    Review Board
    release-2.0.x
    00d477d...

    Reviewers

    This makes a number of optimizations to our diff rendering process:

    1. Some unnecessary string comparisons from the diff_lines template tag
      have been simplified, so that we're only ever comparing one string.
      Small change, but one that should add up over the course of diffs.
      From performance testing, this appears to be roughly 30% faster
      overall.

    2. We were checking if we needed to highlight regions for every
      single line, not just replace lines. This meant we did empty string
      checks, length checks, and string comparisons for every single line.
      These also added up, even though it only ever triggered the
      highlight region code for replace lines.

    Instead of all this work, we now check if it's a replace line
    up-front, preventing all the other checks.

    1. Highlighting regions for a replace line is now twice as fast.
      Instead of parsing every character and keeping track of state,
      we're making use of str.find, which is much faster. We're also
      grabbing ranges of a string at a time, and reducing accesses to
      indexes of regions.

    All unit tests pass. Viewed some diffs, and they worked.

    Ran a series of performance tests. The following correspond to the itemized
    tasks above:

    1. Made a test script for the comparisons and ran it 2,000,000 times.
      The old code took 1.2 seconds. The new code took 0.84 seconds.

    2. Made a test script simulating comparisons for 200 equal lines, 50 inserts,
      20 deletes, and 40 replaces. Ran this 20,000 times. Without the "replace"
      check, it took 2.65 seconds. With the check, it took 1.85 seconds.

    3. Another test script for highlightregion. Ran it on a fairly standard set
      of replaces, 200,000 times. The old code took 6.52 seconds. The new code
      took 3.38 seconds.

    Description From Last Updated

    Can we make this an argument to _diff_line? I don't like using 'self' to pass arguments. Same with _cur_meta, actually.

    daviddavid

    I'm not sure I like the new code. It's very verbose. If you wanted to avoid unnecessary string compares, you …

    daviddavid
    david
    1. 
        
    2. Show all issues

      Can we make this an argument to _diff_line? I don't like using 'self' to pass arguments. Same with _cur_meta, actually.

      1. I can if I pass in a lambda *args: _diff_line(...., *args). I did that at first and then moved to _cur_tag to be consistent with _cur_meta. Felt the overhead of one more function call wasn't worth it, but I didn't benchmark it.

      2. Might be worth including functools.partial() in benchmarking this.

      3. Ran a couple benchmarks that just tested looping and calling the function.

        Test 1 set a member variable for the tag and meta and called the function, which did a brief comparison on the variables. Ran it over 500 "lines", 20,000 times. It took 3.75 seconds on average.

        Test 2 did the same thing, but on each iteration, it used functools.partial() and then called the resulting function. That took an average of 6.53 seconds.

        That's not super real-world, as that's really a lot of lines to process. Shortened both tests to 500 lines, 200 times, and got 0.06 seconds vs. 0.1 second.

        So not a huge difference, especially given the other operations. I can go either way.

      4. I think this new diff is much more clear here.

    3. reviewboard/diffviewer/templatetags/difftags.py (Diff revision 1)
       
       
       
       
       
       
       
       
       
       
       
       
       
       
      Show all issues

      I'm not sure I like the new code. It's very verbose. If you wanted to avoid unnecessary string compares, you could chain the state:

      is_equal = (change == 'equal')
      is_replace = (not is_equal) and (change == 'replace')
      is_insert = (not is_replace) and (change == 'insert')
      is_delete = (not is_insert) and (change == 'delete')
      

      Alternatively, we could have an enum for the chunk type, and use a dict to map from strings to that. Then the code could change from if is_equal: to something like if chunk_type is EQUAL:

      1. Given the options, I prefer my new change, even though it's more verbose. It's fewer comparisons, which is what I'm going for here.

        The idea of an enum is something I am actually intending to do as part of a larger refactor, after 2.0. I have some plans to reduce the number of string compares and significantly decrease the amount of data we have to cache.

      2. Hmm. OK. As long as we do something about this ugliness in the future...

      3. Definitely. There's a lot of performance improvements we can get if we stop doing string comparisons everywhere.

    4. 
        
    chipx86
    david
    1. Ship It!

    2. 
        
    chipx86
    Review request changed
    Status:
    Completed