Normalize all named entites when parsing Markdown HTML.
Review Request #11059 — Created June 26, 2020 and submitted — Latest diff uploaded
When attempting to parse HTML rendered by Python Markdown, we would
crash any time an unknown named entity was encountered by
xml.dom.minidom
. This has been found to happen when Python Markdown
processes an e-mail address that has a Unicode character in it that has
a known named HTML entity. Optimistically, Python Markdown will use this
instead of a character reference (&#...;
), but this isn't known to the
XML parser.Overriding the XML parser is fragile and error-prone, so it's not worth
doing. Instead, we simply undo what Python Markdown does by applying a
regex to convert all known named entities to character references.
These are safe to feed to the XML parser, and gives us the resulting
strings that we want.
Unit tests pass.