Fix issues with multi-byte encodings in content sections.

Review Request #11714 — Created July 10, 2021 and submitted

chipx86
DiffX
master
diffx

The original implementations of the reader and writer didn't properly
handle multi-byte encodings, such as UTF-16 or UTF-32 strings.

The reason for this came down to newlines. We were looking for and
adding newlines based on their 8-bit representation, assuming that all
content would end in those, or that we could add them to the end of each
line while splitting. This didn't actually work, and would result in
failures in multi-byte encodings.

A few things had to be done to address this:

  1. Some design flaws in the text utilities, which made assumptions on
    lengths, offsets, and the presence of newline characters, had to be
    fixed.

  2. We no longer store our newline constants as byte strings, but rather
    Unicode strings. They're then encoded as needed to ensure that
    they're in a compatible format. BOMs are stripped from these, to
    avoid further corruption of content.

  3. The order in which we encode, decode, and process strings for content
    sections has changed a bit. We add on a missing newline before
    encoding, rather than after. We encode or decode newlines at the same
    time as the content strings.

Unit tests were updated to check different multi-byte encodings for all
the content sections, making sure they are written and parsed correctly.

All unit tests pass on Python 2 and 3.

Summary
Fix issues with multi-byte encodings in content sections.
Description From Last Updated

E101 indentation contains mixed spaces and tabs

reviewbotreviewbot

W191 indentation contains tabs

reviewbotreviewbot

E131 continuation line unaligned for hanging indent

reviewbotreviewbot

E101 indentation contains mixed spaces and tabs

reviewbotreviewbot

E101 indentation contains mixed spaces and tabs

reviewbotreviewbot

W191 indentation contains tabs

reviewbotreviewbot

E131 continuation line unaligned for hanging indent

reviewbotreviewbot

E101 indentation contains mixed spaces and tabs

reviewbotreviewbot

E101 indentation contains mixed spaces and tabs

reviewbotreviewbot

W191 indentation contains tabs

reviewbotreviewbot

E131 continuation line unaligned for hanging indent

reviewbotreviewbot

E101 indentation contains mixed spaces and tabs

reviewbotreviewbot

E101 indentation contains mixed spaces and tabs

reviewbotreviewbot

W191 indentation contains tabs

reviewbotreviewbot

E131 continuation line unaligned for hanging indent

reviewbotreviewbot

E101 indentation contains mixed spaces and tabs

reviewbotreviewbot

F821 undefined name 'line_endings'

reviewbotreviewbot

F821 undefined name 'newline'

reviewbotreviewbot

E501 line too long (80 > 79 characters)

reviewbotreviewbot

E501 line too long (81 > 79 characters)

reviewbotreviewbot

F821 undefined name 'newline'

reviewbotreviewbot

F821 undefined name 'line_endings'

reviewbotreviewbot
Checks run (1 failed, 1 succeeded)
flake8 failed.
JSHint passed.

flake8

chipx86
Review request changed

Change Summary:

Fixed tabs that snuck in from some copy/pastes while building strings for unit tests.

Commits:

Summary
-
Fix issues with multi-byte encodings in content sections.
+
Fix issues with multi-byte encodings in content sections.

Diff:

Revision 2 (+872 -160)

Show changes

Checks run (1 failed, 1 succeeded)

flake8 failed.
JSHint passed.

flake8

chipx86
chipx86
david
  1. Ship It!
  2. 
      
chipx86
Review request changed

Status: Closed (submitted)

Change Summary:

Pushed to master (cec15fe)
Loading...