Summary

Normalize all named entites when parsing Markdown HTML.

Review Request #11059 — Created June 26, 2020 and submitted June 29, 2020, 8:48 p.m.

Information

Owner

chipx86

Repository

Djblets

Branch

release-1.0.x

Bugs

Depends On

Reviewers

Groups

djblets

People

Description

When attempting to parse HTML rendered by Python Markdown, we would
crash any time an unknown named entity was encountered by
xml.dom.minidom. This has been found to happen when Python Markdown
processes an e-mail address that has a Unicode character in it that has
a known named HTML entity. Optimistically, Python Markdown will use this
instead of a character reference (&#...;), but this isn't known to the
XML parser.

Overriding the XML parser is fragile and error-prone, so it's not worth
doing. Instead, we simply undo what Python Markdown does by applying a
regex to convert all known named entities to character references.
These are safe to feed to the XML parser, and gives us the resulting
strings that we want.

Testing Done

Unit tests pass.

Commits

Summary	ID
Normalize all named entites when parsing Markdown HTML. When attempting to parse HTML rendered by Python Markdown, we would crash any time an unknown named entity was encountered by `xml.dom.minidom`. This has been found to happen when Python Markdown processes an e-mail address that has a Unicode character in it that has a known named HTML entity. Optimistically, Python Markdown will use this instead of a character reference (`&#...;`), but this isn't known to the XML parser. Overriding the XML parser is fragile and error-prone, so it's not worth doing. Instead, we simply undo what Python Markdown does by applying a regex to convert all known named entities to character references. These are safe to feed to the XML parser, and gives us the resulting strings that we want.	d08b9a090acff89691dfa6f514d7026dd44a414b

Summary

Normalize all named entites when parsing Markdown HTML.

When attempting to parse HTML rendered by Python Markdown, we would crash any time an unknown named entity was encountered by `xml.dom.minidom`. This has been found to happen when Python Markdown processes an e-mail address that has a Unicode character in it that has a known named HTML entity. Optimistically, Python Markdown will use this instead of a character reference (`&#...;`), but this isn't known to the XML parser. Overriding the XML parser is fragile and error-prone, so it's not worth doing. Instead, we simply undo what Python Markdown does by applying a regex to convert all known named entities to character references. These are safe to feed to the XML parser, and gives us the resulting strings that we want.

d08b9a090acff89691dfa6f514d7026dd44a414b

flake8 passed.

JSHint passed.

Ship it!

```
Ship It!
```

Status:: Completed
Change Summary:: Pushed to release-1.0.x (c0fb79f)