Fish Trophy

chipx86 got a fish trophy!

Fish Trophy

Improve performance when rendering diff lines.

Review Request #5555 — Created Feb. 27, 2014 and submitted

Information

Review Board
release-2.0.x
00d477d...

Reviewers

This makes a number of optimizations to our diff rendering process:

  1. Some unnecessary string comparisons from the diff_lines template tag
    have been simplified, so that we're only ever comparing one string.
    Small change, but one that should add up over the course of diffs.
    From performance testing, this appears to be roughly 30% faster
    overall.

  2. We were checking if we needed to highlight regions for every
    single line, not just replace lines. This meant we did empty string
    checks, length checks, and string comparisons for every single line.
    These also added up, even though it only ever triggered the
    highlight region code for replace lines.

Instead of all this work, we now check if it's a replace line
up-front, preventing all the other checks.

  1. Highlighting regions for a replace line is now twice as fast.
    Instead of parsing every character and keeping track of state,
    we're making use of str.find, which is much faster. We're also
    grabbing ranges of a string at a time, and reducing accesses to
    indexes of regions.

All unit tests pass. Viewed some diffs, and they worked.

Ran a series of performance tests. The following correspond to the itemized
tasks above:

  1. Made a test script for the comparisons and ran it 2,000,000 times.
    The old code took 1.2 seconds. The new code took 0.84 seconds.

  2. Made a test script simulating comparisons for 200 equal lines, 50 inserts,
    20 deletes, and 40 replaces. Ran this 20,000 times. Without the "replace"
    check, it took 2.65 seconds. With the check, it took 1.85 seconds.

  3. Another test script for highlightregion. Ran it on a fairly standard set
    of replaces, 200,000 times. The old code took 6.52 seconds. The new code
    took 3.38 seconds.

Description From Last Updated

Can we make this an argument to _diff_line? I don't like using 'self' to pass arguments. Same with _cur_meta, actually.

daviddavid

I'm not sure I like the new code. It's very verbose. If you wanted to avoid unnecessary string compares, you …

daviddavid
david
  1. 
      
  2. Show all issues

    Can we make this an argument to _diff_line? I don't like using 'self' to pass arguments. Same with _cur_meta, actually.

    1. I can if I pass in a lambda *args: _diff_line(...., *args). I did that at first and then moved to _cur_tag to be consistent with _cur_meta. Felt the overhead of one more function call wasn't worth it, but I didn't benchmark it.

    2. Might be worth including functools.partial() in benchmarking this.

    3. Ran a couple benchmarks that just tested looping and calling the function.

      Test 1 set a member variable for the tag and meta and called the function, which did a brief comparison on the variables. Ran it over 500 "lines", 20,000 times. It took 3.75 seconds on average.

      Test 2 did the same thing, but on each iteration, it used functools.partial() and then called the resulting function. That took an average of 6.53 seconds.

      That's not super real-world, as that's really a lot of lines to process. Shortened both tests to 500 lines, 200 times, and got 0.06 seconds vs. 0.1 second.

      So not a huge difference, especially given the other operations. I can go either way.

    4. I think this new diff is much more clear here.

  3. reviewboard/diffviewer/templatetags/difftags.py (Diff revision 1)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    I'm not sure I like the new code. It's very verbose. If you wanted to avoid unnecessary string compares, you could chain the state:

    is_equal = (change == 'equal')
    is_replace = (not is_equal) and (change == 'replace')
    is_insert = (not is_replace) and (change == 'insert')
    is_delete = (not is_insert) and (change == 'delete')
    

    Alternatively, we could have an enum for the chunk type, and use a dict to map from strings to that. Then the code could change from if is_equal: to something like if chunk_type is EQUAL:

    1. Given the options, I prefer my new change, even though it's more verbose. It's fewer comparisons, which is what I'm going for here.

      The idea of an enum is something I am actually intending to do as part of a larger refactor, after 2.0. I have some plans to reduce the number of string compares and significantly decrease the amount of data we have to cache.

    2. Hmm. OK. As long as we do something about this ugliness in the future...

    3. Definitely. There's a lot of performance improvements we can get if we stop doing string comparisons everywhere.

  4. 
      
chipx86
david
  1. Ship It!

  2. 
      
chipx86
Review request changed
Status:
Completed