[Performance] Improve parsing of huge diffs (hours -> minutes)
Review Request #8075 — Created March 25, 2016 and discarded — Latest diff uploaded
This patch is simple, but significantly reduces parsing time for huge diffs (size>10Mb, lines>500K). Current implementation concatenates string line by line, so it consumes memory O(N*S) where N is amount of lines per file in diff, S is file diff size. Proposed implementation collects lines into array and then concatenates into one string. So it requires less memory - O(N).
The issue was found on huge diff file (500K lines, 30MB size). Original processing time was around 30min and process was killed by linux OOM killer. After fix, processing time was few minutes (2-3) and finished successfully.
For information, cprofile & perf graphs are attached to review request: method parse_diff_header consumes 95% of time and most of time is spent for string concatenations.
Fix is live on production environment during last 2 weeks. No regression issue found.