Normalize all named entites when parsing Markdown HTML.

Review Request #11059 — Created June 26, 2020 and submitted

Information

Djblets
release-1.0.x

Reviewers

When attempting to parse HTML rendered by Python Markdown, we would
crash any time an unknown named entity was encountered by
xml.dom.minidom. This has been found to happen when Python Markdown
processes an e-mail address that has a Unicode character in it that has
a known named HTML entity. Optimistically, Python Markdown will use this
instead of a character reference (&#...;), but this isn't known to the
XML parser.

Overriding the XML parser is fragile and error-prone, so it's not worth
doing. Instead, we simply undo what Python Markdown does by applying a
regex to convert all known named entities to character references.
These are safe to feed to the XML parser, and gives us the resulting
strings that we want.

Unit tests pass.

Summary ID
Normalize all named entites when parsing Markdown HTML.
When attempting to parse HTML rendered by Python Markdown, we would crash any time an unknown named entity was encountered by `xml.dom.minidom`. This has been found to happen when Python Markdown processes an e-mail address that has a Unicode character in it that has a known named HTML entity. Optimistically, Python Markdown will use this instead of a character reference (`&#...;`), but this isn't known to the XML parser. Overriding the XML parser is fragile and error-prone, so it's not worth doing. Instead, we simply undo what Python Markdown does by applying a regex to convert all known named entities to character references. These are safe to feed to the XML parser, and gives us the resulting strings that we want.
d08b9a090acff89691dfa6f514d7026dd44a414b
david
  1. Ship It!
  2. 
      
chipx86
Review request changed

Status: Closed (submitted)

Change Summary:

Pushed to release-1.0.x (c0fb79f)
Loading...