chipx86 got a fish trophy!

Summary

Improve performance when rendering diff lines.

Review Request #5555 — Created Feb. 27, 2014 and submitted Feb. 28, 2014, 9:53 p.m.

Information

Owner

chipx86

Repository

Review Board

Branch

release-2.0.x

Bugs

Depends On

Commit

00d477d...

Reviewers

Groups

reviewboard

People

Description

This makes a number of optimizations to our diff rendering process:

Some unnecessary string comparisons from the diff_lines template tag
have been simplified, so that we're only ever comparing one string.
Small change, but one that should add up over the course of diffs.
From performance testing, this appears to be roughly 30% faster
overall.
We were checking if we needed to highlight regions for every
single line, not just replace lines. This meant we did empty string
checks, length checks, and string comparisons for every single line.
These also added up, even though it only ever triggered the
highlight region code for replace lines.

Instead of all this work, we now check if it's a replace line
up-front, preventing all the other checks.

Highlighting regions for a replace line is now twice as fast.
Instead of parsing every character and keeping track of state,
we're making use of str.find, which is much faster. We're also
grabbing ranges of a string at a time, and reducing accesses to
indexes of regions.

Testing Done

All unit tests pass. Viewed some diffs, and they worked.

Ran a series of performance tests. The following correspond to the itemized
tasks above:

Made a test script for the comparisons and ran it 2,000,000 times.
The old code took 1.2 seconds. The new code took 0.84 seconds.
Made a test script simulating comparisons for 200 equal lines, 50 inserts,
20 deletes, and 40 replaces. Ran this 20,000 times. Without the "replace"
check, it took 2.65 seconds. With the check, it took 1.85 seconds.
Another test script for highlightregion. Ran it on a fairly standard set
of replaces, 200,000 times. The old code took 6.52 seconds. The new code
took 3.38 seconds.

Issues

Description	From	Last Updated
Can we make this an argument to _diff_line? I don't like using 'self' to pass arguments. Same with _cur_meta, actually.	david	Feb. 28, 2014, 12:46 a.m.
I'm not sure I like the new code. It's very verbose. If you wanted to avoid unnecessary string compares, you …	david	Feb. 28, 2014, 1:38 p.m.

reviewboard/diffviewer/chunk_generator.py (Diff revision 1)

The issue has been resolved. Show all issues

Can we make this an argument to _diff_line? I don't like using 'self' to pass arguments. Same with _cur_meta, actually.

chipx86 Feb. 27, 2014, 8:15 p.m.

I can if I pass in a lambda *args: _diff_line(...., *args). I did that at first and then moved to _cur_tag to be consistent with _cur_meta. Felt the overhead of one more function call wasn't worth it, but I didn't benchmark it.

david Feb. 28, 2014, 12:01 a.m.

Might be worth including functools.partial() in benchmarking this.

chipx86 Feb. 28, 2014, 12:40 a.m.

Ran a couple benchmarks that just tested looping and calling the function.

Test 1 set a member variable for the tag and meta and called the function, which did a brief comparison on the variables. Ran it over 500 "lines", 20,000 times. It took 3.75 seconds on average.

Test 2 did the same thing, but on each iteration, it used functools.partial() and then called the resulting function. That took an average of 6.53 seconds.

That's not super real-world, as that's really a lot of lines to process. Shortened both tests to 500 lines, 200 times, and got 0.06 seconds vs. 0.1 second.

So not a huge difference, especially given the other operations. I can go either way.

david Feb. 28, 2014, 12:56 a.m.

I think this new diff is much more clear here.

reviewboard/diffviewer/templatetags/difftags.py (Diff revision 1)

The issue has been dropped. Show all issues

I'm not sure I like the new code. It's very verbose. If you wanted to avoid unnecessary string compares, you could chain the state:

is_equal = (change == 'equal')
is_replace = (not is_equal) and (change == 'replace')
is_insert = (not is_replace) and (change == 'insert')
is_delete = (not is_insert) and (change == 'delete')


Alternatively, we could have an enum for the chunk type, and use a dict to map from strings to that. Then the code could change from if is_equal: to something like if chunk_type is EQUAL:

chipx86 Feb. 27, 2014, 8:15 p.m.

Given the options, I prefer my new change, even though it's more verbose. It's fewer comparisons, which is what I'm going for here.

The idea of an enum is something I am actually intending to do as part of a larger refactor, after 2.0. I have some plans to reduce the number of string compares and significantly decrease the amount of data we have to cache.

david Feb. 28, 2014, 1 a.m.

Hmm. OK. As long as we do something about this ugliness in the future...

chipx86 Feb. 28, 2014, 1:38 p.m.

Definitely. There's a lot of performance improvements we can get if we stop doing string comparisons everywhere.

Change Summary:


Used functools.partial instead of storing meta and tag on the class.

Commit:

dd332fb16699b7013e12393401d74d3392574113

00d477d1e01afc8fff79cbd18eb52da43a0347c2

Diff:

Revision 2 (+53 -41)

Show changes

	reviewboard/diffviewer/chunk_generator.py
	reviewboard/diffviewer/templatetags/difftags.py

Ship it!

```
Ship It!
```

Status:: Completed