Summary

Normalize all named entites when parsing Markdown HTML.

Review Request #11059 — Created June 26, 2020 and submitted June 29, 2020, 8:48 p.m. — Latest diff uploaded June 26, 2020, 5:21 p.m.

Information

Owner

chipx86

Repository

Djblets

Branch

release-1.0.x

Bugs

Depends On

Reviewers

Groups

djblets

People

Description

When attempting to parse HTML rendered by Python Markdown, we would
crash any time an unknown named entity was encountered by
xml.dom.minidom. This has been found to happen when Python Markdown
processes an e-mail address that has a Unicode character in it that has
a known named HTML entity. Optimistically, Python Markdown will use this
instead of a character reference (&#...;), but this isn't known to the
XML parser.

Overriding the XML parser is fragile and error-prone, so it's not worth
doing. Instead, we simply undo what Python Markdown does by applying a
regex to convert all known named entities to character references.
These are safe to feed to the XML parser, and gives us the resulting
strings that we want.

Testing Done

Unit tests pass.

Review Board 7.1 alpha 0 (dev)

Normalize all named entites when parsing Markdown HTML.

Information

Reviewers

Commits

Files