Summary

Fix issues with multi-byte encodings in content sections.

Review Request #11714 — Created July 10, 2021 and submitted July 20, 2021, 2:43 p.m.

Information

Owner

chipx86

Repository

DiffX

Branch

master

Bugs

Depends On

Reviewers

Groups

diffx

People

Description

The original implementations of the reader and writer didn't properly
handle multi-byte encodings, such as UTF-16 or UTF-32 strings.

The reason for this came down to newlines. We were looking for and
adding newlines based on their 8-bit representation, assuming that all
content would end in those, or that we could add them to the end of each
line while splitting. This didn't actually work, and would result in
failures in multi-byte encodings.

A few things had to be done to address this:

Some design flaws in the text utilities, which made assumptions on
lengths, offsets, and the presence of newline characters, had to be
fixed.
We no longer store our newline constants as byte strings, but rather
Unicode strings. They're then encoded as needed to ensure that
they're in a compatible format. BOMs are stripped from these, to
avoid further corruption of content.
The order in which we encode, decode, and process strings for content
sections has changed a bit. We add on a missing newline before
encoding, rather than after. We encode or decode newlines at the same
time as the content strings.

Unit tests were updated to check different multi-byte encodings for all
the content sections, making sure they are written and parsed correctly.

Testing Done

All unit tests pass on Python 2 and 3.

Commits

Summary	ID
Fix issues with multi-byte encodings in content sections. The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.	13fdb6033d6fe51870fceda32efafbbc1e9fabe6

Summary

Fix issues with multi-byte encodings in content sections.

The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.

13fdb6033d6fe51870fceda32efafbbc1e9fabe6

Issues

Description	From	Last Updated
E101 indentation contains mixed spaces and tabs	reviewbot	July 10, 2021, 4:34 p.m.
W191 indentation contains tabs	reviewbot	July 10, 2021, 4:34 p.m.
E131 continuation line unaligned for hanging indent	reviewbot	July 10, 2021, 4:34 p.m.
E101 indentation contains mixed spaces and tabs	reviewbot	July 10, 2021, 4:35 p.m.
E101 indentation contains mixed spaces and tabs	reviewbot	July 10, 2021, 4:35 p.m.
W191 indentation contains tabs	reviewbot	July 10, 2021, 4:35 p.m.
E131 continuation line unaligned for hanging indent	reviewbot	July 10, 2021, 4:35 p.m.
E101 indentation contains mixed spaces and tabs	reviewbot	July 10, 2021, 4:35 p.m.
E101 indentation contains mixed spaces and tabs	reviewbot	July 10, 2021, 4:35 p.m.
W191 indentation contains tabs	reviewbot	July 10, 2021, 4:35 p.m.
E131 continuation line unaligned for hanging indent	reviewbot	July 10, 2021, 4:35 p.m.
E101 indentation contains mixed spaces and tabs	reviewbot	July 10, 2021, 4:35 p.m.
E101 indentation contains mixed spaces and tabs	reviewbot	July 10, 2021, 4:35 p.m.
W191 indentation contains tabs	reviewbot	July 10, 2021, 4:35 p.m.
E131 continuation line unaligned for hanging indent	reviewbot	July 10, 2021, 4:35 p.m.
E101 indentation contains mixed spaces and tabs	reviewbot	July 10, 2021, 4:35 p.m.
F821 undefined name 'line_endings'	reviewbot	July 10, 2021, 4:35 p.m.
F821 undefined name 'newline'	reviewbot	July 10, 2021, 4:35 p.m.
E501 line too long (80 > 79 characters)	reviewbot	July 10, 2021, 4:37 p.m.
E501 line too long (81 > 79 characters)	reviewbot	July 10, 2021, 4:37 p.m.
F821 undefined name 'newline'	reviewbot	July 10, 2021, 4:37 p.m.
F821 undefined name 'line_endings'	reviewbot	July 10, 2021, 4:37 p.m.

flake8 failed.

JSHint passed.

flake8

python/diffx/tests/test_reader.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E101 indentation contains mixed spaces and tabs
```
python/diffx/tests/test_reader.py (Diff revision 1)
The issue has been resolved. Show all issues
```
W191 indentation contains tabs
```
python/diffx/tests/test_reader.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E131 continuation line unaligned for hanging indent
```
python/diffx/tests/test_reader.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E101 indentation contains mixed spaces and tabs
```
python/diffx/tests/test_reader.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E101 indentation contains mixed spaces and tabs
```
python/diffx/tests/test_reader.py (Diff revision 1)
The issue has been resolved. Show all issues
```
W191 indentation contains tabs
```
python/diffx/tests/test_reader.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E131 continuation line unaligned for hanging indent
```
python/diffx/tests/test_reader.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E101 indentation contains mixed spaces and tabs
```
python/diffx/tests/test_writer.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E101 indentation contains mixed spaces and tabs
```
python/diffx/tests/test_writer.py (Diff revision 1)
The issue has been resolved. Show all issues
```
W191 indentation contains tabs
```
python/diffx/tests/test_writer.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E131 continuation line unaligned for hanging indent
```
python/diffx/tests/test_writer.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E101 indentation contains mixed spaces and tabs
```
python/diffx/tests/test_writer.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E101 indentation contains mixed spaces and tabs
```
python/diffx/tests/test_writer.py (Diff revision 1)
The issue has been resolved. Show all issues
```
W191 indentation contains tabs
```
python/diffx/tests/test_writer.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E131 continuation line unaligned for hanging indent
```
python/diffx/tests/test_writer.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E101 indentation contains mixed spaces and tabs
```
python/diffx/utils/text.py (Diff revision 1)
The issue has been resolved. Show all issues
```
F821 undefined name 'line_endings'
```
python/diffx/utils/text.py (Diff revision 1)
The issue has been resolved. Show all issues
```
F821 undefined name 'newline'
```

Change Summary:

Fixed tabs that snuck in from some copy/pastes while building strings for unit tests.

Commits:

	Summary	ID
	Fix issues with multi-byte encodings in content sections. The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.	665a0e757004b20d03716a5743cef3e941575c8c
	Fix issues with multi-byte encodings in content sections. The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.	f68b4e116f265909781f1fcca153f335955aaf09

Diff:

Revision 2 (+872 -160)

Show changes

	python/diffx/reader.py
	python/diffx/writer.py
	python/diffx/tests/test_reader.py
	python/diffx/tests/test_writer.py
	python/diffx/tests/testcases.py
	python/diffx/utils/text.py

Checks run (1 failed, 1 succeeded)

flake8 failed.

JSHint passed.

flake8

python/diffx/tests/test_reader.py (Diff revision 2)
The issue has been resolved. Show all issues
```
E501 line too long (80 > 79 characters)
```
python/diffx/tests/test_reader.py (Diff revision 2)
The issue has been resolved. Show all issues
```
E501 line too long (81 > 79 characters)
```
python/diffx/utils/text.py (Diff revision 2)
The issue has been resolved. Show all issues
```
F821 undefined name 'newline'
```
python/diffx/utils/text.py (Diff revision 2)
The issue has been resolved. Show all issues
```
F821 undefined name 'line_endings'
```

Change Summary:

Fixed some long lines and a dead line of code.

Commits:

	Summary	ID
	Fix issues with multi-byte encodings in content sections. The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.	f68b4e116f265909781f1fcca153f335955aaf09
	Fix issues with multi-byte encodings in content sections. The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.	01e00fd8de9311f367ea7d7580c756c1b1987319

Diff:

Revision 3 (+872 -160)

Show changes

	python/diffx/reader.py
	python/diffx/writer.py
	python/diffx/tests/test_reader.py
	python/diffx/tests/test_writer.py
	python/diffx/tests/testcases.py
	python/diffx/utils/text.py

Checks run (2 succeeded)

flake8 passed.

JSHint passed.

Change Summary:

Fixed an over-zealous tab to spaces replacement in the unit tests. Switched affected test cases to use \t instead of a tab character.

Commits:

	Summary	ID
	Fix issues with multi-byte encodings in content sections. The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.	01e00fd8de9311f367ea7d7580c756c1b1987319
	Fix issues with multi-byte encodings in content sections. The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.	13fdb6033d6fe51870fceda32efafbbc1e9fabe6

Diff:

Revision 4 (+868 -160)

Show changes

	python/diffx/reader.py
	python/diffx/writer.py
	python/diffx/tests/test_reader.py
	python/diffx/tests/test_writer.py
	python/diffx/tests/testcases.py
	python/diffx/utils/text.py

Checks run (2 succeeded)

flake8 passed.

JSHint passed.

Ship it!

```
Ship It!
```

Status:: Completed
Change Summary:: Pushed to master (cec15fe)

python/diffx/utils/text.py (Diff revision 1)

python/diffx/utils/text.py (Diff revision 1)

python/diffx/utils/text.py (Diff revision 2)

python/diffx/utils/text.py (Diff revision 2)