[Performance] Improve parsing of huge diffs (hours -> minutes)

Review Request #8075 — Created March 25, 2016 and discarded

Information

Review Board

Reviewers

This patch is simple, but significantly reduces parsing time for huge diffs (size>10Mb, lines>500K). Current implementation concatenates string line by line, so it consumes memory O(N*S) where N is amount of lines per file in diff, S is file diff size. Proposed implementation collects lines into array and then concatenates into one string. So it requires less memory - O(N).

The issue was found on huge diff file (500K lines, 30MB size). Original processing time was around 30min and process was killed by linux OOM killer. After fix, processing time was few minutes (2-3) and finished successfully.

For information, cprofile & perf graphs are attached to review request: method parse_diff_header consumes 95% of time and most of time is spent for string concatenations.

Fix is live on production environment during last 2 weeks. No regression issue found.


Description From Last Updated

Col: 38 E261 at least two spaces before inline comment

reviewbotreviewbot

Col: 39 E262 inline comment should start with '# '

reviewbotreviewbot

Col: 9 E265 block comment should start with '# '

reviewbotreviewbot

I believe this will result in an extra newline. How about: file.data += b'\n'.join(buf) + '\n'

daviddavid

Style nit: blank line between these.

brenniebrennie

Same comment here with the extra newline.

daviddavid

This isn't necessary, since we're done with the buffer.

daviddavid
reviewbot
  1. Tool: Pyflakes
    Processed Files:
        reviewboard/diffviewer/parser.py
    
    
    
    Tool: PEP8 Style Checker
    Processed Files:
        reviewboard/diffviewer/parser.py
    
    
  2. reviewboard/diffviewer/parser.py (Diff revision 1)
     
     
    Show all issues
    Col: 38
     E261 at least two spaces before inline comment
    
  3. reviewboard/diffviewer/parser.py (Diff revision 1)
     
     
    Show all issues
    Col: 39
     E262 inline comment should start with '# '
    
  4. reviewboard/diffviewer/parser.py (Diff revision 1)
     
     
    Show all issues
    Col: 9
     E265 block comment should start with '# '
    
  5. 
      
MI
reviewbot
  1. Tool: Pyflakes
    Processed Files:
        reviewboard/diffviewer/parser.py
    
    
    
    Tool: PEP8 Style Checker
    Processed Files:
        reviewboard/diffviewer/parser.py
    
    
  2. 
      
david
  1. Perhaps we should use BytesIO?

  2. reviewboard/diffviewer/parser.py (Diff revision 2)
     
     
     
    Show all issues

    I believe this will result in an extra newline. How about:

    file.data += b'\n'.join(buf) + '\n'
    
    1. Thanks! To avoid extra malloc let's append empty string:

      buf.append(b'')
      file.data += b'\n'.join(buf)
      
  3. reviewboard/diffviewer/parser.py (Diff revision 2)
     
     
     
    Show all issues

    Same comment here with the extra newline.

  4. reviewboard/diffviewer/parser.py (Diff revision 2)
     
     
    Show all issues

    This isn't necessary, since we're done with the buffer.

  5. 
      
brennie
  1. 
      
  2. reviewboard/diffviewer/parser.py (Diff revision 2)
     
     
     
    Show all issues

    Style nit: blank line between these.

  3. 
      
MI
reviewbot
  1. Tool: Pyflakes
    Processed Files:
        reviewboard/diffviewer/parser.py
    
    
    
    Tool: PEP8 Style Checker
    Processed Files:
        reviewboard/diffviewer/parser.py
    
    
  2. 
      
chipx86
  1. Hey Michael,

    This looks like a great improvement! We've been writing new code to use this same method, but the diff parser is much older, and we weren't aware of this performance issue before.

    I'll need to take some time to dig into this (the change is small, but I'll want to do some additional testing anyway). This will be great to get in, though! Thanks!

  2. 
      
chipx86
  1. Hey, thanks again for the contribution. Your e-mail just now reminded me of it (been busy with other things, as usual).

    The fundamental idea to get rid of string concatenation and toward list processing is probably the right move. String concatenation in Python is actually faster than appending to lists and then joining these days, but there is the memory requirement, as you note. By storing in a list and then joining, we can reuse the strings coming in and concat once at the end, saving a good deal of memory.

    However, the way it's being done here only solves a part of the problem (other SCMTools, like Git, have custom diff parsers that also manipulate file.data and aren't receiving any benefit here and in fact likely negate the performance benefits, if they don't flat-out generate incorrect results). There are other concerns I have about assumptions being made with regard to the file contents (I'll cover this below), and API breakage (parse_diff_line's signature changes, which can break subclasses).

    For the performance gains to be optimal and the results to be correct, the following must be true:

    1. Data must always goes in the list, and we must never concatenate to file.data until we're finished.
    2. Data from self.lines should always go into the list as-is, unless we're modifying the line in question. By retaining the exact string, we take advantage of the ref counting of the string, saving memory.
    3. Since subclasses can freely override parts of the diff parsing, this function cannot safely assume that the joined list (b'\n'.join(buf)) can be appending to file.data at the places where it's assuming this now.

    The last one is a problem. Right now, what's happening is you're building this buffer and assuming at the end that you have a complete picture of the state of the diff, later appending its contents. However, subclasses are also going to assume they can modify file.data, and will do so before you append anything, causing lines to be out of order. This would happen with Git if it didn't override parse (which also means it's not getting any performance benefits here).

    So to fix all this without having to rewrite and break the world, I suggest the following:

    1. Introduce a new object, let's call it DiffBytes, that acts like a string (can be concatenated using +=) but wraps appending to a list. Something like:

    class DiffBytes(object):
        def __init__(self):
            self._buf = []
    
        def __iadd__(self, s):
            self._buf.append(s)
            return self
    
        def __str__(self):
            return ''.join(self._buf)
    

    1. Set file.data to this initially, and normalize back to strings when parsing has completed for the file.
      What the code is all doing today, adding more data, is fine, but string concatenation in Python isn't very optimal. Lists and buffers are, but we can't change every call without breaking API compatibility (and can't change function signatures for the same reasons). So instead, how about taking data and making it something other than a string.
    2. When joining, don't put newlines in-between. Just do ''.join(buf). Leave newlines to the code doing the += (or you'll break code assuming it can build strings piece-meal).
    3. For those callers doing a += on a line coming in from self.lines, split those into two calls: One for the line (so we can reuse an existing string in memory), and one for the +=.

    This should give you the same speed benefits that your change has for all SCMs but Git, and introduce those benefits to Git, and any other subclasses modifying data themselves. It should also keep the API stable.

    One thing I noted in the review request is that there was no mention of the unit tests passing. That's going to also be important in order for us to accept the change. Also, explicit testing with Git.

    1. Thank you for detailed review, Christian! Superb! I totally agree with proposed solution, and it looks easy to do.
      But I'm about big holidays, so I'll come back on middle of May (sorry). I'll try to find way to speed up it.

      Thx!

    2. I think using/wrapping BytesIO would be even better than appending to a list and then joining.

    3. I ran some profiling on the various methods. In each scenario, I built up 24,214,784 bytes worth of 1,193,636 lines from a data file, stored as a list. I then processed it, building up a string using the different methods discussed (string concat, list + join, BytesIO). My tests calculated total times for each step, and memory used in the process.

      Setup code was common between each, with a RSS of ~85MB after building up the source list of lines.

      I also instrumented each iteration of the loops, to see how RSS grew.

      Test 1: String concatenation

      Concatenation took ~0.35 seconds, and added ~32MB to RSS.

      Test 2: List join

      The list, before joining it into a string, took ~0.33 seconds and added ~25MB to RSS.

      Turning it into a string added another ~0.2 seconds, consuming an additional ~33MB.

      Grand total: ~0.53 seconds and ~60MB to RSS.

      (Note that the list of strings isn't necessarily going to be garbage collected right away.)

      Test 3: BytesIO

      Building the BytesIO took ~0.48 seconds, and added ~34MB to RSS.

      Conclusion

      The list method saves some memory during parsing, at the expense of time, but the memory savings is lost after parsing has completed, until Python's able to reclaim the memory. However this does appear to be worse overall than we had anticipated..

      BytesIO doesn't appear to be any faster or more efficient than string concatenation.

      So what I'm now wondering, Michael, is whether the problem you initially saw was indeed a performance difference in diff parsing, or whether there was something else going on with the server when the tests were run? Perhaps an Apache worker/thread had ballooned up and was in some bad state, and restarting Apache (when the new patch was deployed) appeared to have fixed things because it was starting fresh?

      Or, maybe my code is wrong. Here it is: https://gist.github.com/chipx86/ae3cf4e7cf353b69b1ba2c96f123852a

      By the way, what version of Review Board were you running?

    4. Running each version a single time isn't a good way to do performance testing. You should make your code do each method several thousand times and take an average.

  2. 
      
david
Review request changed
Status:
Discarded