Enable detection of variable rename on Diff Viewer.

Review Request #1701 — Created July 3, 2010 and discarded

Review Board
Detection is only active if the user also decides to enable syntax
highlight. The rationale is that if he wants to pay for a time
consuming feature, he can also pay for the other and both features
depends of Pygments for tokenization.

That being said, the most time consuming part is the tokenization.
I did write a tokenizer by hand (based on regular expressions). but
I got caught up in several corner cases (such as the use of $ as a valid
token in javascript). Latter on, I just figured that it would be better to
use a more robust, albeit heavier, tokenizer to do that job.

As a guideline, renames should be one to one, AKA one token changing to another.
If there are multiple renames on the same line, it's a special case that shouldn't be
overlooked by the reviewer.

Also refactor lexer guessing out of apply_pygments so we have to guess it only once.
I'll be adding a test case here, for future automatic testing, but I did test it on several types of diffs.
  1. Awesome. Glad to see this coming together :)
    Can you post some screenshots demonstrating this?
    1. I missed your reviews! Now I have a reason for therapy again, hehehe.
  2. reviewboard/accounts/forms.py (Diff revision 1)
    Can you wrap this to 80 chars?
  3. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    Blank line between these.
  4. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    Blank line between these.
  5. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    Can you default lexer to None?
    1. Is it really necessary? There is only one point in the code that calls this function, and it's always putting that parameter.
  6. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    Blank line here.
  7. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    "it isn't a rename."
  8. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    Should use "is pygments.token.Name", since it's a class.
    Or, maybe we want to handle subclasses? I don't know if they even have those or how their lexer really works. Looking at token.py, I see Name, Name.Attribute, Name.Property, etc., which we would want to support, and I don't know how that maps.
    1. Looking at the code I found that we can deal with subclassing using the "in" operator. Just changed it.
  9. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    This looks to be indented too far in.
  10. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    This should be removed.
  11. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    Does get_tokens_unprocessed parse the line? I assume so. In this case, do we really even need syntax highlighting enabled for this feature? It seems that even if it's disabled, we should be able to parse the line.
    It would be nice to be able to get to the original tokens used when syntax highlighting as a performance boost, but I suspect that that's tricky/impossible.
    1. That is really tricky. But I just found a way to do it, though it will increase memory use during the diff process, as a stream of tokens for the entire file can be huge. During highlight it uses generators, so memory consumption is not an issue, but if we were to store the token stream of the entire file for latter use that can become a problem.
      The idea behind this is that the cost of highlight and of rename detection are about the same, so if a user chooses to pay for one, it could pay for the other, even though they are not dependent on each other.
      I thought this wouldn't be a major issue, as we are only processing the changed lines (a subset of what gets highlighted).
      I'll post an update of the other corrections in this review and then we can discuss the tradeoffs of the token reuse.
  12. reviewboard/diffviewer/diffutils.py (Diff revision 1)
    We calculate (i2 - i1) twice now, so let's calculate it once and then compare against that.
  13. This should be indented 4 spaces from the $(
  14. This probably should too.
    And actually:
  15. Maybe "This file only contains lines with renamed functions, variables, or other content."
    I want to start making Review Board "sound" better. We're far from being there, but I think there's a lot we could do to make things sound more user-friendly and descriptive.
  2. reviewboard/accounts/forms.py (Diff revision 2)
    Instead of the parentheses, I'd just say "and variable rename detection" (or "identifier rename detection")
    It might also be worth somehow adding a note that there's a non-trivial performance overhead for enabling pygments-based features.
    1. Not sure how to phrase such a note. I think it's in the user's benefit to enable syntax highlight, and variable detection, even if it has an overhead.
      What do you think?
  3. reviewboard/diffviewer/diffutils.py (Diff revision 2)
    It doesn't look like you actually do anything with "rename" here?
    1. I do. I use it to check if the rename change result in the chunk. I renamed this variable to be clearer.
  4. reviewboard/diffviewer/diffutils.py (Diff revision 2)
    "chunck" -> "chunk"
Review request changed

Status: Discarded