Implement Thumbnail Rendering for Text-ish File Attachments (.rst and .md)

Review Request #3454 — Created Oct. 24, 2012 and submitted

slchen
Review Board
master
reviewboard
Implemented thumbnail rendering for ReStructuredText (.rst) 
and MarkDown (.md) file attachment types. The rendered 
thumbnails are displayed as HTML scaled down by css (instead
of rendered as image thumbs), and are mem cached. 

Also changed the way MimetypeHandlers are registered to match
existing infrastructure for registering Review UIs, replacing
the old way of doing it through __subclass__
1. Ran unit tests for reviewboard.attachments.tests:

- Updated tests now all pass after adding setUp() and tearDown()
  to register / unregister the test Mimetype Handlers.

2. Manual testing on localhost:

Tested by uploading to a new review-request:
- raw .txt file
- raw .rst file
- raw .md file
- raw .jpg file
- raw .rst file with javascript injection
- raw .md file with javascript injection

Visual inspection of the rendered thumbnails all pass: 
- Thumbnails appear immediately (without a refresh), reflecting
  latest feature updates to master.
- Scaling, cropping and rendering all appear to be correct.
- Javascript injection correctly escaped for malicious
  .rst and .md files

See updated screenshots (2012-11-24, in the files attachment section)
Loading file attachments...

  • 0
  • 0
  • 46
  • 4
  • 50
Description From Last Updated
  1. 
      
  2. reviewboard/attachments/mimetypes.py (Diff revision 1)
     
     
    Can also write this as to make it a bit shorter:
    
    if extension.lower() == ".rst"
  3. 
      
david
  1. Can you add some screenshots?
  2. 
      
chipx86
  1. 
      
  2. reviewboard/attachments/mimetypes.py (Diff revision 4)
     
     
     
     
     
     
    imports are a bit wrong.
    
    The froms should all be together and the imports should all be together. They also need to be alphabetical.
    
    djblets is a third-party dependency (as far as RB is concerned), os it should be with the others above. No blank line.
    
    The docutils import should probably just do an import of docutils.core and docutils.io. Importing core and io will make code harder to read. The names are just too common.
  3. reviewboard/attachments/mimetypes.py (Diff revision 4)
     
     
     
     
     
     
    This is probably not the right place to do this.
    
    We should probably define a new map below MIMETYPE_ICON_ALIASES that maps file extensions to mimetypes. Then we can do:
    
    extension = os.path.splitext(attachment.filename)[1].lower()
    
    if extension in MIMETYPE_EXTENSIONS:
        mimetype = MIMETYPE_EXTENSIONS[extension]
    
    
    And that should be above the original parse_mime_type call.
  4. reviewboard/attachments/mimetypes.py (Diff revision 4)
     
     
     
     
    No blank line.
  5. reviewboard/attachments/mimetypes.py (Diff revision 4)
     
     
    I'm not sure we want to escape this here?
    
    A better name would be "data".
  6. reviewboard/attachments/mimetypes.py (Diff revision 4)
     
     
     
     
     
     
    Should use format strings, and no camel casing. You can also combine with the below.
    
    return mark_safe('<div class="file-thumbnail-clipped">%s%s%s</div>'
                     % (rst_parts['title'], ......)
    
    What happens if we have a document without those bits? Can that ever happen?
    1. Going with rst_parts['html_body'] which will always have that bit
  7. reviewboard/attachments/mimetypes.py (Diff revision 4)
     
     
     
     
     
     
     
     
     
     
    Same comments from above apply.
  8. reviewboard/attachments/mimetypes.py (Diff revision 4)
     
     
    Can you limit the number of lines actually shown? Maybe do splitlines()[:5] on the result.
    1. Limiting to 20 on the rationale that:
      a) 20 lines is enough to render sufficient content to fill the thumbnails if we need to change the scaling ratio in the future, and
      b) 20 lines is few enough such that there's only negligible efficiency difference between say 5 lines and 20 lines 
      
  9. reviewboard/static/rb/css/reviews.less (Diff revision 4)
     
     
     
     
     
    Should be:
    
    h1, h2, body, p {
      font-size: 30%;
    }
    1. changed to * { font-size: 60% } to scale everything down in the thumbnail area.
  10. 
      
chipx86
  1. Think we're just about there! There's some minor style stuff (mostly where blank lines should go, and some statements that could be easily combined).
  2. reviewboard/attachments/mimetypes.py (Diff revision 5)
     
     
     
     
    Two blank lines.
  3. reviewboard/attachments/mimetypes.py (Diff revision 5)
     
     
     
    This comment should be made more generic, since this is no longer just about .rst files.
  4. reviewboard/attachments/mimetypes.py (Diff revision 5)
     
     
     
    Blank line between these.
  5. reviewboard/attachments/mimetypes.py (Diff revision 5)
     
     
     
    You can just combine these two lines.
  6. reviewboard/attachments/mimetypes.py (Diff revision 5)
     
     
     
     
     
    Combine these too.
  7. reviewboard/attachments/mimetypes.py (Diff revision 5)
     
     
     
     
     
     
     
     
    Same comments as above.
  8. reviewboard/attachments/mimetypes.py (Diff revision 5)
     
     
     
     
    Two blank lines.
  9. reviewboard/static/rb/css/reviews.less (Diff revision 5)
     
     
     
     
    No blank line here.
  10. reviewboard/static/rb/css/reviews.less (Diff revision 5)
     
     
     
    Blank line.
  11. reviewboard/static/rb/css/reviews.less (Diff revision 5)
     
     
     
     
    So the issue with this is that it'll affect units for margins, padding, etc. of anything contained within. I'm concerned about that. I'd much rather we be explicit about the types we care about for now.
  12. 
      
  1. 
      
  2. reviewboard/attachments/mimetypes.py (Diff revision 6)
     
     
    I wonder, what's the trade off between capping the max_length while reading the file in and reading the whole file then parsing it?
    1. There is no trade off as I can see: capping the max_length while reading the file would be strictly better. +1 for pointing that out =) open an issue here and I'll fix it in the next update.
  3. reviewboard/attachments/mimetypes.py (Diff revision 6)
     
     
     
     
     
    Just curious, do you know when will the efficiency difference be not negligible when running functions like this? What will happen if it loads a very large troll file with no line break at all (or one at the very end)?
    1. I think that's a design decision based on the following 2 questions:
      - What is the likelihood that someone uploads a legitimate file with few lines, but each line having thousands (or more) of character columns? (In this case, imposing a limit on the length of each line would be undesired behaviour)
      - What is the likelihood that a user of Reviewboard uploads a very large troll file? (I think this has fairly low likelihood: I can't think of a scenario where a malicious agent is able to plant a large file on the local system of a Reviewboard user with the intent for it to be uploaded as a file attachment to a review-request. I suppose it might be done as a prank upon discovery of the processing logic here in RB. However, in the event that it does happen, this would have worst case performance of a normal large uncropped text-ish file with many lines: possibly several seconds of UI delay)
    2. It's entirely possible that someone could attempt to take down a server by having a line that large. It's also entirely possible it'd be accidental due to some auto-generated file. This happens in diffs. We should attempt to crop.
  4. reviewboard/attachments/mimetypes.py (Diff revision 6)
     
     
    Do you think it would be better if 200 is a variable instead? (like max_length) Should it be the same length for all text files supported?
    1. I was debating this, actually. Magic numbers are evil, but I pondered if implementing a variable for just these 2 places would be overkill and actually reduce code readability ("Hrm, what's max_length in this context?"). Thoughts, mentors?
    2. I'd prefer a variable.
  5. 
      
  1. 
      
  2. reviewboard/attachments/mimetypes.py (Diff revision 6)
     
     
     
     
    Should be moved above the 'from'. The imports should be grouped together and from statements should be grouped together (as Christian pointed out earlier)
    1. Not an issue -- spoke to Sampson about it. It's ordered by precedence:
      
      1st group: Python specific
      2nd group: 3rd party (including django, djblets)
      3rd group: RB specific
      
      This issue can be ignored and closed.
  3. 
      
chipx86
  1. So I'm noticing a lot of this code is just duplicated. We should really do something about that. Here's what I propose.
    
    TextMimetype says "Here's how to handle a text mimetype." That should really be the basis.
    
    The ReST and Markdown ones are text renderers that know specialized types of text files. So they should really subclass TextMimetype.
    
    TextMimetype should have a function that, given a blob of text, will returned the appropriate HTML. By default, it'd just return the text, escaped. The other handlers could override to do their thing.
    
    One thing that might trip you up. Right now, mimetype class finding is using __subclasses__ to get the mimetype, which we really don't want long-term, and fails here. So that will need to basically do what ReviewUIs do, with the register/unregister functions.
    
    I'd recommend making a separate change before this goes in that does exactly that. Removes the __subclasses__ iteration and instead introduces a _registered_mimetype_handlers list that it iterates. Then (un)register_mimetype_handler functions that manipulate. And something just like reviews/ui/__init__.py in attachments/__init__.py to do the registration. You should be able to just do that separately (in a new branch off master, without these changes), test it with text mimetypes, and then post for review.
    
    You can also move this branch on top of that one until it goes in, so you can consolidate these classes and use the new registration functions.
  2. reviewboard/attachments/mimetypes.py (Diff revision 9)
     
     
     
     
    This is all done below. It should only be returned once.
    
    However.. "file is closed" is pretty useless to all users. They'll just go, "what?"
    
    This is an error. It should be logged, with enough context for us to know what failed and where, and the default thumbnail should be used instead.
  3. reviewboard/attachments/mimetypes.py (Diff revision 9)
     
     
    That doesn't really describe what the function does.
    
    """Returns the HTML for a thumbnail preview for a ReST file."""
    
    would be more what we'd want to see.
    
    Imagine you were viewing API docs, and only had function names and summaries. Write what you'd need to be able to easily find the appropriate function for your needs.
  4. reviewboard/attachments/mimetypes.py (Diff revision 9)
     
     
    Not really a useful comment. That's pretty clear from the code.
  5. reviewboard/attachments/mimetypes.py (Diff revision 9)
     
     
    self.FILE_CROP_CHAR_LIMIT
  6. reviewboard/attachments/mimetypes.py (Diff revision 9)
     
     
     
     
     
     
     
    This should all be returned only once.
    
    Same comment as above about logging and a default.
  7. reviewboard/attachments/mimetypes.py (Diff revision 9)
     
     
    Same comments for this file.
  8. 
      
chipx86
  1. 
      
  2. reviewboard/attachments/__init__.py (Diff revision 10)
     
     
     
     
     
     
     
    You should import all these in one go, not with different "from ..." lines.
  3. reviewboard/attachments/__init__.py (Diff revision 10)
     
     
     
     
    Two blank lines.
  4. reviewboard/attachments/__init__.py (Diff revision 10)
     
     
     
     
     
     
    It'd be nice to keep this alphabetical.
    
    Is the MimetypeHandler registration needed?
    1. Regarding MimetypeHandler: from looking at the code itself, the class currently seems to act as a fail-safe default in case no good match can be found for the mimetype. Its description seems to corroborate: "This class also acts as a generic handler for mimetypes not matched explicitly by any handler. Note that this is not the same as '*/*'."
      
      I think it's safer overall to leave it in; thoughts?
  5. reviewboard/attachments/mimetypes.py (Diff revision 10)
     
     
    Can you prefix with a _ ? That indicates it's not a public function.
    
    Same with all others introduced that aren't part of the ReviewUI API or something that needs to be overridden in a subclass.
    
    Also, this function should be more specifically named. As it is, I have no idea what it's a cache key for, and in theory, we might want multiple keys. As far as I can tell, it's only used in one place, so many just keep that logic there unless a subclass has to override this.
  6. reviewboard/attachments/mimetypes.py (Diff revision 10)
     
     
    This doesn't need to go into specifics on how it's built, and probably shouldn't. Just say what it does.
    
    Also, "&" shouldn't be used, and it should terminate in a period.
  7. reviewboard/attachments/mimetypes.py (Diff revision 10)
     
     
    It's a bit.. verbose. How about just generate_preview_html?
  8. reviewboard/attachments/mimetypes.py (Diff revision 10)
     
     
     
    Should probably be on the class itself.
  9. reviewboard/attachments/mimetypes.py (Diff revision 10)
     
     
    Use splitlines.
  10. reviewboard/attachments/mimetypes.py (Diff revision 10)
     
     
     
    One statement.
  11. 
      
david
  1. 
      
  2. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
     
    Can we call these TEXT_CROP_NUM_LINES and TEXT_CROP_LINE_LENGTH? I think it would be a little more clear.
  3. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
    Whitespace
  4. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
    This comment really doesn't add much.
  5. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
    Can you say "file extensions"? I had a moment of confusion where I thought "reviewboard extensions"
  6. reviewboard/settings.py (Diff revision 13)
     
     
     
     
     
    Is this really something that we expect to ever want to change/override? It seems like it would be better to just inline it in the code where it's used.
    1. That's what Django looks for when using the restructuredtext filter, which does some security checks for us.
    2. (That is, that'll be used if my recommendations are applied.)
  7. 
      
chipx86
  1. 
      
  2. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
    We always import from django.conf.settings. It should be:
    
    from django.conf import settings
    
    However, you don't want this. See my comment below.
  3. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
    This doesn't really fit our other logs. Instead, do:
    
    "Failed to read from file attachment %s: %s' % (self.attachment.pk, e)
    
    Also note that you don't need to do str(...), since %s does that for you.
  4. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
    You should be able to just provide self._generate_thumbnail as the function without a lambda.
  5. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
     
     
     
    Rather than call this yourself, I think you should probably use Django's restructuredtext filter. You can import it from django.contrib.markup.templatetags.
    
    That'll reference RESTRUCTUREDTEXT_FILTER_SETTINGS itself, and do all the other right things.
    
    One thing though is that that will return parts["fragment"], not parts["html_body"]. I'm curious what that looks like and how it's different.
    1. See issue below with markdown and the comments about discussion regarding this with David.
  6. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
     
    You never need to supply \ when it's in parens.
    
    Also, you have some indentation problems.
  7. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
     
    Indentation problem.
  8. reviewboard/attachments/mimetypes.py (Diff revision 13)
     
     
     
    This should absolutely use Django's markdown filter (importable from the same place as above). That does all sorts of very important security checks that we do not. Certain versions of markdown have different vulnerabilities, and Django has a lot of work to check for those. We want this.
    1. I wasn't sure exactly how to approach this so I talked about it with David in IRC. After discussion, I think the best solution we decided on was to simply re-implement the markdown version checks in mimetypes.py (See IRC log Nov. 23rd - 18:00:00 - 19:00:00 for additional details). 
  9. reviewboard/settings.py (Diff revision 13)
     
     
     
     
    Should be:
    
    BLAHBLAH = {
        'file_insertion_enabled': 0,
        ...
    }
    
    There should be a trailing comma on all entries.
    
    I also think this should be up higher, out of the way. Maybe right under WEB_API_ENCODERS.
  10. 
      
chipx86
  1. 
      
  2. reviewboard/attachments/mimetypes.py (Diff revision 14)
     
     
     
    Given that we're going to call publish_parts ourselves, we should just inline these settings and force them, rather than have them in the settings file.
  3. reviewboard/attachments/mimetypes.py (Diff revision 14)
     
     
     
     
    What we probably want to do is just take the very latest version of markdown that's out right now and just make that a required dependency in setup.py. Django has to be more flexible because it doesn't depend on setup.py, but we can ensure we have exactly what we're expecting. That'll let us get rid of all these version checks.
  4. 
      
david
  1. 
      
  2. reviewboard/attachments/mimetypes.py (Diff revision 15)
     
     
     
     
     
    The formatting here could use some improvement. I'd also like to pull the settings dict inline (it's only used in one place):
    
    Also, in the settings dict, should the 0s and 1s be False and True?
    
    settings = {
        'file_insertion_enabled': 0,
        'raw_enabled': 0,
        '_disable_config': 1,
    }
    return docutils.core.publish_parts(
        source=smart_str(data_string),
        writer_name='html4css1',
        settings_overrides=settings)['html_body']
    1. Fixed style issues; 
      
      On the docutils docs page regarding the same input sanitation issue, they used 0's and 1's: http://docutils.sourceforge.net/docs/howto/security.html Should I leave it as it is?
    2. It looks like in the code that uses things like file_insertion_enabled, it's checking truthiness, so I think we should use True/False (assuming it works).
    3. Changed to True/False and did a quick test with both malicious and non-malicious .rst file attachment - it works =)
  3. 
      
david
  1. Ship It!
    1. Actually, I was about to push this change, but it breaks a couple tests:
      
      ======================================================================
      FAIL: Testing matching of factory method for mimetype handlers.
      ----------------------------------------------------------------------
      Traceback (most recent call last):
        File "/Users/david/Projects/reviewboard/reviewboard/attachments/tests.py", line 83, in test_handler_factory
          self.assertEqual(self._handler_for("test/abc"), TestAbcMimetype)
      AssertionError: None != <class 'reviewboard.attachments.tests.TestAbcMimetype'>
      
      ======================================================================
      FAIL: Testing precedence of factory method for mimetype handlers.
      ----------------------------------------------------------------------
      Traceback (most recent call last):
        File "/Users/david/Projects/reviewboard/reviewboard/attachments/tests.py", line 95, in test_handler_factory_precedence
          self.assertEqual(self._handler_for("test2/def"), StarDefMimetype)
      AssertionError: None != <class 'reviewboard.attachments.tests.StarDefMimetype'>
      
      ----------------------------------------------------------------------
      Ran 575 tests in 138.740s
      
      FAILED (SKIP=35, failures=2)
    2. Fixed!
      
      ...
      255 static files copied.
      Creating test database for alias 'default'...
      Testing uploading a file attachment. ... ok
      Testing matching of factory method for mimetype handlers. ... ok
      Testing precedence of factory method for mimetype handlers. ... ok
      
      ----------------------------------------------------------------------
      Ran 3 tests in 0.251s
      
      OK
      Destroying test database for alias 'default'...
      
  2. 
      
david
  1. Ship It!
  2. 
      
Review request changed

Status: Closed (submitted)

Change Summary:

Pushed to release-1.7.x (b898885). Thanks!
  1. 
      
  2. reviewboard/attachments/mimetypes.py (Diff revision 18)
     
     
     
    This is probably too late since this is already merged in but "enable_attributes" default to False when safe_mode is enabled.
    
    Docs: http://packages.python.org/Markdown/reference.html#enable_attributes
    1. I actually looked into this exact issue at the time. The reference pointed to by ChipX86 here: http://security.stackexchange.com/questions/14664/how-do-i-use-markdown-securely says the recommended usage is to set enable_attributes=False explicitly because of Markdown versioning issues. While atm we make markdown>=2.1 a required dependency and it sets enabled_attributes=False by default when setting safe_mode=True, I thought it was safer to just set enable_attributes=False explicitly anyway in case that changes in future Markdown versions.
  3. 
      
Loading...