Enabling lexer guessing

Review Request #3156 — Created June 22, 2012 and discarded

Information

Review Board

Reviewers

Prolog is not perl, and pygments is smart enough to differentiate the two if we
give it a chance. Pygment's *_lexer_for_filename() functions can guess the lexer
based on two things...
- the filename
- the contents of the file, especially the shebang line

The later of the two was disabled in commit '36b761a', making pygments pretty
dumb about applying syntax highlighting.

For instance, if you ever see red rectangles around '$' characters in perl code
then that's what's happening. Both prolog and perl have a '.pl' file extension
and without file contents pygments flips a coin and half the time guesses the
former (it sorts based on heuristics, which are then both zero).

If there's only one possible option (for instance, a '.java' extension) then
this doesn't have any extra runtime cost. If there is ambiguity (such as perl
vs prolog) then it's an O(n) operation over the file contents. To avoid having
this (or the following pygments.highlight() call) take too long we're imposing
a two second timeout.

We're exercising this change with a ReviewBoard 1.5 instance and has slightly
increased the ReviewBoardDiffFragment latency (p50 raised from 0.1 to 0.2
seconds), but that's about it. Imho this latency cost is well worth having a
far more readable diff (I've had syntax highlighting turned off for years
because it's distracting to read code incorrectly marked as being full of
syntax errors).

Note that this patch itself has not been tested against master - we're running
an identical change against RB 1.5.

 
Description From Last Updated

Alphabetical order: Should be before the "subprocess" import.

ME medanat

Use same line for import. i.e.: from pygments.lexers import get_lexer_for_filename, guess_lexer_for_filename ref: http://www.python.org/dev/peps/pep-0008/#imports

ME medanat

No need for this empty line.

ME medanat

Oops, minor mistake - lexer may be undefined or None at this point.

AT atagar
AT
  1. Change pushed to...
    https://github.com/atagar/ReviewBoard/commit/71de26348ad16089e5fc4e5cc3fa9b445065fd5f
    
  2. 
      
chipx86
  1. The reason we disabled this was that some file types (I recall XML files, particularly larger ones, being a key problem) take a *very* long time to guess. It's those file types that really caused the key problems causing us to switch.
    
    I'd appreciate some testing with XML files (ones with a couple hundred lines would be good), and Java files.
    1. Also, what I'd like before re-enabling this is to have a new field in the FileDiff model that allows us to store the type that's guessed, so that we only ever guess once per file, even if it falls out of cache.
    2. > The reason we disabled this was that some file types (I recall XML files, particularly larger ones, being a key problem)
      
      Special xml handling was a separate change (note the 'if filename.endswith(".xml"):' check). Was there another use case that prompted the complete removal of lexer guessing? I just rendered a 500 line java file without any issue, and I can look around for some xml if you'd like though it shouldn't make a difference since xml files hit the old code path.
      
      > Also, what I'd like before re-enabling this is to have a new field in the FileDiff model
      
      That would be a nice optimization, but I don't see why it needs to block this change. This has negligible impact on performance, and even in the worst case means hitting a two second timeout (which I have yet to see happen).
      
    3. It's possible that the real big issues were fixed in Pygments between when we initially disabled that and now.
      
      I'll give the code a look over in a bit and see about getting it into 1.6.10.
    4. I thought one of our major problems was large XML-ish files that didn't end with a .xml extension.
    5. That may have been it.
    6. > I thought one of our major problems was large XML-ish files that didn't end with a .xml extension.
      
      Pygments runs 'fnmatch.fnmatch(fn, filename)' to match various patterns (ex '*.py') against the base filename to narrow down the options. Then, after that, it uses the code to pick the best lexer. If the xml code analysis is taking too long then this would only arise with the following file extensions...
      *.xml, *.xsl, *.rss, *.xslt, *.xsd, and *.wsdl
      
      And the following other lexers...
      VelocityXmlLexer, XmlErbLexer, XmlPhpLexer, XmlSmartyLexer, XmlDjangoLexer, JspLexer, SspLexer, and XsltLexer
      
      Pygments is caching looks_like_xml() results so the regexes might still be slow...
      https://bitbucket.org/birkenfeld/pygments-main/src/5cc94956e233/pygments/util.py#cl-194
      
      If this concerns you then I'd suggest submitting a patch to pygments so looks_like_xml() can be disabled, and for now monkey patching the function to always return True. I still think that the cure of disabling code analysis is worse than the disease since it makes ReviewBoard's syntax highlighting for several very common programming languages worse than not having it at all, but up to you.
      
  2. 
      
AT
  1. 
      
  2. reviewboard/diffviewer/diffutils.py (Diff revision 1)
     
     
    Show all issues
    Oops, minor mistake - lexer may be undefined or None at this point.
    1. Hu, when I try to press "Fixed" for this I get a '403 FORBIDDEN' response (from a UI perspective the buttons are simply grayed out after I click them). That sucks.
      
      I was able to resolve the issues that Yazan mentioned so maybe it's because I filed this one? I've tried it several times now but no luck.
      
    2. Oh I see, it was just taking a long while to register and further requests were getting a forbidden response. Makes sense.
      
  3. 
      
ME
  1. Found a few issues that may need some attention.
  2. reviewboard/diffviewer/diffutils.py (Diff revision 1)
     
     
    Show all issues
    Alphabetical order: Should be before the "subprocess" import.
  3. reviewboard/diffviewer/diffutils.py (Diff revision 1)
     
     
     
    Show all issues
    Use same line for import.
    
    i.e.:
    
    from pygments.lexers import get_lexer_for_filename, guess_lexer_for_filename
    
    ref: http://www.python.org/dev/peps/pep-0008/#imports
  4. reviewboard/diffviewer/diffutils.py (Diff revision 1)
     
     
    Show all issues
    No need for this empty line.
    1. Personally I think that it's more readable to have an empty line between multi-line comments and code. Changed comment to a single line where it belongs so we'll both be happy. :)
  5. reviewboard/diffviewer/diffutils.py (Diff revision 1)
     
     
     
     
    These two calls can raise pygments.util.ClassNotFound.
    
    I'm not sure if you want to handle that at this point.
    1. Ahhh, good point. Our caller will catch the ClassNotFound but I'm pretty sure that our alarm will then raise a TimeoutException at some random point further along (ick!). Disabling the alarm in a finally clause.
  6. reviewboard/diffviewer/diffutils.py (Diff revision 1)
     
     
     
     
     
     
    Not your code I know, but this should probably use:
    
    if hasattr(lexer, "add_filter")
    
    Rather than try/except AttributeError.
    1. Good idea, changed.
  7. 
      
AT
AT
  1. Hi Chris, it has been seven months since I posted an update. Any thoughts?
    1. Hi Damian,
      
      We don't always get every change in for various reasons, and unfortunately sometimes changes don't really get on our plate due to just being overwhelmed my patches, feature requests, support, and just all the development tasks that we fit into the couple hours a night we have to do all that. I'm sorry for this.
      
      As far as this patch goes, I'm fine with lexer guessing, but I'm not comfortable with the signal code that's in there. Modifying global state like that is scary to me.
      
      We also would require caching the guessed type in the database so that this is only ever done once per file.
      
      And we'd need testing shown for this in the Testing Done. Ideally, with unit tests.
    2. Thanks Chris.
      
      > We don't always get every change in for various reasons, and unfortunately sometimes changes don't really get on our plate due to just being overwhelmed my patches, feature requests, support, and just all the development tasks that we fit into the couple hours a night we have to do all that. I'm sorry for this.
      
      No worries, I know the feeling. I've been trying for days to scrape together the time to get our ducks in a row for GSoC and still no luck. That said, the part about being 'overwhelmed by patches' makes me a little jealous - we'd love to have that problem over at Tor. ;)
      
      > As far as this patch goes, I'm fine with lexer guessing, but I'm not comfortable with the signal code that's in there. Modifying global state like that is scary to me.
      
      Makes sense. The global state actually hadn't crossed my mind. This is what I use on occasion in my own code for timeouts since it's the least-hacky option I could find...
      
      https://gitweb.torproject.org/stem.git/blob/HEAD:/stem/process.py#l109
      
      ... but with a bit more looking it seems that arbitrary function timeouts is simply something that no language can do safely...
      
      http://eli.thegreenplace.net/2011/08/22/how-not-to-set-a-timeout-on-a-computation-in-python/
      
      We can drop the timeout if you want. We've been running this patch with a ReviewBoard instance rivaling the scale of KDE for a year without issue. The timeouts are infrequent but they do happen so we'll likely keep it in our instance (at a quick glance of the logs there have been ten in the last hour, ~0.09% of file requests, likely load related since a spot check of some of the files don't appear to have anything odd about them).
      
      > We also would require caching the guessed type in the database so that this is only ever done once per file.
      
      Good idea. The lexer guessing runtimes for our instances are...
      
      max: 1.526s
      min: 0.100s
      avg: 0.290s
      
      I'll propose this to my team, though honestly I doubt I'll be given the time to work on this any time soon.
      
      > And we'd need testing shown for this in the Testing Done. Ideally, with unit tests.
      
      As mentioned above we've been running with this for around a year without issue. I disagree that a Selenium test makes sense here - the only thing being 'tested' would be the pygments lexer guessing functionality, would would be better covered by a pygments test (... and no doubt has one).
      
    3. Ah, GSoC. Yeah, I didn't even include that :) We're doing GSoC as well, and a similar program called UCOSP, which is fantastic and takes place during school semesters. Worth looking into. 
      
      I didn't realize Tor's using Review Board. That's awesome. Can we list you guys on our Happy Users page?
      
      So, my hope is that a lot of the lengthy highlighting issues have been worked out. We've added heuristics to disable highlighting for certain conditions (really long lines, which covers minified files, and really long files, for instance). There's still a couple cases, but we'll fix those as well.
      
      For now, yeah, I'd say let's drop the timeout handling from this patch.
      
      As for the tests, I didn't mean Selenium. Our Selenium test suite is basically not in use anymore. Just standard unit tests. One I think would be good here is to feed get_chunks a file with an extension that would fail to be looked up without guessing, but succeeds when guessing. Another would be to just generally check that we have successfully stored the guessed lexer. Another still would be to stick the guessed lexer in the database and ensure that we don't end up re-guessing.
      
      (Maybe move apply_pygments out into its own function and just put the tests around that).
    4. > Ah, GSoC. Yeah, I didn't even include that :) We're doing GSoC as well
      
      Neat! Hopefully I'll meet you at the summit.
      
      > and a similar program called UCOSP, which is fantastic and takes place during school semesters. Worth looking into.
      
      Will do. OPW (open source program for women) is another, which just started a couple years back. It has a less technical focus (more on documentation, video tutorials, etc) but is definitely worth a look.
      
      > I didn't realize Tor's using Review Board.
      
      Unfortunately we don't. There has been a little discussion about setting up a ReviewBoard instance (https://trac.torproject.org/6402) but our project-to-people ratio is horrible. Actually, we have more projects than developers, so code reviews very rarely happen in practice. :(
      
      Rather, I maintain a ReviewBoard instance at work. Though unfortunately they shy away from me talking about it (sorry!).
      
      > For now, yeah, I'd say let's drop the timeout handling from this patch.
      
      Sounds good. Are we gonna hold off on merging this until we have the database type caching?
      
      > As for the tests, I didn't mean Selenium. Our Selenium test suite is basically not in use anymore. Just standard unit tests.
      
      Oh. I checked for unit tests when you initially suggested it but I mostly just found SeleniumUnitTest. I know RBTools has unit tests but I'm not seeing much in ReviewBoard. Are they just getting started?
      
      ~/src/ReviewBoard% grep -R "import unittest" ./*
      ./contrib/tools/tests.py:import unittest
      ./reviewboard/diffviewer/tests.py:import unittest
      
    5. Yeah, I'd like to get the database caching in as part of this. I'd like to do what we can to reduce load times.
      
      We have a few hundred unit tests. Look for any file called tests.py. The reason the "import unittest" didn't show much was that we often inherit from Django's test classes.
  2. 
      
SW
  1. I have one week Stability test ongoing in that environment so this log taking can be done on this Friday.
  2. 
      
AT
Review request changed
Status:
Discarded