[Performance] Improve parsing of huge diffs (hours -> minutes)
Review Request #8075 — Created March 25, 2016 and discarded
This patch is simple, but significantly reduces parsing time for huge diffs (size>10Mb, lines>500K). Current implementation concatenates string line by line, so it consumes memory O(N*S) where N is amount of lines per file in diff, S is file diff size. Proposed implementation collects lines into array and then concatenates into one string. So it requires less memory - O(N).
The issue was found on huge diff file (500K lines, 30MB size). Original processing time was around 30min and process was killed by linux OOM killer. After fix, processing time was few minutes (2-3) and finished successfully.
For information, cprofile & perf graphs are attached to review request: method parse_diff_header consumes 95% of time and most of time is spent for string concatenations.
Fix is live on production environment during last 2 weeks. No regression issue found.
Description | From | Last Updated |
---|---|---|
Col: 38 E261 at least two spaces before inline comment |
reviewbot | |
Col: 39 E262 inline comment should start with '# ' |
reviewbot | |
Col: 9 E265 block comment should start with '# ' |
reviewbot | |
I believe this will result in an extra newline. How about: file.data += b'\n'.join(buf) + '\n' |
david | |
Style nit: blank line between these. |
brennie | |
Same comment here with the extra newline. |
david | |
This isn't necessary, since we're done with the buffer. |
david |
-
Tool: Pyflakes Processed Files: reviewboard/diffviewer/parser.py Tool: PEP8 Style Checker Processed Files: reviewboard/diffviewer/parser.py
- Change Summary:
-
Thanks for review! Patch is updated according to David's & Barret's remarks.
- Diff:
-
Revision 3 (+15 -4)
-
Tool: Pyflakes Processed Files: reviewboard/diffviewer/parser.py Tool: PEP8 Style Checker Processed Files: reviewboard/diffviewer/parser.py
-
Hey Michael,
This looks like a great improvement! We've been writing new code to use this same method, but the diff parser is much older, and we weren't aware of this performance issue before.
I'll need to take some time to dig into this (the change is small, but I'll want to do some additional testing anyway). This will be great to get in, though! Thanks!
-
Hey, thanks again for the contribution. Your e-mail just now reminded me of it (been busy with other things, as usual).
The fundamental idea to get rid of string concatenation and toward list processing is probably the right move. String concatenation in Python is actually faster than appending to lists and then joining these days, but there is the memory requirement, as you note. By storing in a list and then joining, we can reuse the strings coming in and concat once at the end, saving a good deal of memory.
However, the way it's being done here only solves a part of the problem (other SCMTools, like Git, have custom diff parsers that also manipulate
file.data
and aren't receiving any benefit here and in fact likely negate the performance benefits, if they don't flat-out generate incorrect results). There are other concerns I have about assumptions being made with regard to the file contents (I'll cover this below), and API breakage (parse_diff_line
's signature changes, which can break subclasses).For the performance gains to be optimal and the results to be correct, the following must be true:
- Data must always goes in the list, and we must never concatenate to
file.data
until we're finished. - Data from
self.lines
should always go into the list as-is, unless we're modifying the line in question. By retaining the exact string, we take advantage of the ref counting of the string, saving memory. - Since subclasses can freely override parts of the diff parsing, this function cannot safely assume that the joined list (
b'\n'.join(buf)
) can be appending tofile.data
at the places where it's assuming this now.
The last one is a problem. Right now, what's happening is you're building this buffer and assuming at the end that you have a complete picture of the state of the diff, later appending its contents. However, subclasses are also going to assume they can modify
file.data
, and will do so before you append anything, causing lines to be out of order. This would happen with Git if it didn't overrideparse
(which also means it's not getting any performance benefits here).So to fix all this without having to rewrite and break the world, I suggest the following:
- Introduce a new object, let's call it
DiffBytes
, that acts like a string (can be concatenated using+=
) but wraps appending to a list. Something like:
class DiffBytes(object): def __init__(self): self._buf = [] def __iadd__(self, s): self._buf.append(s) return self def __str__(self): return ''.join(self._buf)
- Set
file.data
to this initially, and normalize back to strings when parsing has completed for the file.
What the code is all doing today, adding more data, is fine, but string concatenation in Python isn't very optimal. Lists and buffers are, but we can't change every call without breaking API compatibility (and can't change function signatures for the same reasons). So instead, how about takingdata
and making it something other than a string. - When joining, don't put newlines in-between. Just do
''.join(buf)
. Leave newlines to the code doing the+=
(or you'll break code assuming it can build strings piece-meal). - For those callers doing a
+=
on a line coming in fromself.lines
, split those into two calls: One for the line (so we can reuse an existing string in memory), and one for the+=
.
This should give you the same speed benefits that your change has for all SCMs but Git, and introduce those benefits to Git, and any other subclasses modifying data themselves. It should also keep the API stable.
One thing I noted in the review request is that there was no mention of the unit tests passing. That's going to also be important in order for us to accept the change. Also, explicit testing with Git.
- Data must always goes in the list, and we must never concatenate to