• 
      

    Fix issues with multi-byte encodings in content sections.

    Review Request #11714 — Created July 10, 2021 and submitted

    Information

    DiffX
    master

    Reviewers

    The original implementations of the reader and writer didn't properly
    handle multi-byte encodings, such as UTF-16 or UTF-32 strings.

    The reason for this came down to newlines. We were looking for and
    adding newlines based on their 8-bit representation, assuming that all
    content would end in those, or that we could add them to the end of each
    line while splitting. This didn't actually work, and would result in
    failures in multi-byte encodings.

    A few things had to be done to address this:

    1. Some design flaws in the text utilities, which made assumptions on
      lengths, offsets, and the presence of newline characters, had to be
      fixed.

    2. We no longer store our newline constants as byte strings, but rather
      Unicode strings. They're then encoded as needed to ensure that
      they're in a compatible format. BOMs are stripped from these, to
      avoid further corruption of content.

    3. The order in which we encode, decode, and process strings for content
      sections has changed a bit. We add on a missing newline before
      encoding, rather than after. We encode or decode newlines at the same
      time as the content strings.

    Unit tests were updated to check different multi-byte encodings for all
    the content sections, making sure they are written and parsed correctly.

    All unit tests pass on Python 2 and 3.

    Summary ID
    Fix issues with multi-byte encodings in content sections.
    The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.
    13fdb6033d6fe51870fceda32efafbbc1e9fabe6
    Description From Last Updated

    E101 indentation contains mixed spaces and tabs

    reviewbotreviewbot

    W191 indentation contains tabs

    reviewbotreviewbot

    E131 continuation line unaligned for hanging indent

    reviewbotreviewbot

    E101 indentation contains mixed spaces and tabs

    reviewbotreviewbot

    E101 indentation contains mixed spaces and tabs

    reviewbotreviewbot

    W191 indentation contains tabs

    reviewbotreviewbot

    E131 continuation line unaligned for hanging indent

    reviewbotreviewbot

    E101 indentation contains mixed spaces and tabs

    reviewbotreviewbot

    E101 indentation contains mixed spaces and tabs

    reviewbotreviewbot

    W191 indentation contains tabs

    reviewbotreviewbot

    E131 continuation line unaligned for hanging indent

    reviewbotreviewbot

    E101 indentation contains mixed spaces and tabs

    reviewbotreviewbot

    E101 indentation contains mixed spaces and tabs

    reviewbotreviewbot

    W191 indentation contains tabs

    reviewbotreviewbot

    E131 continuation line unaligned for hanging indent

    reviewbotreviewbot

    E101 indentation contains mixed spaces and tabs

    reviewbotreviewbot

    F821 undefined name 'line_endings'

    reviewbotreviewbot

    F821 undefined name 'newline'

    reviewbotreviewbot

    E501 line too long (80 > 79 characters)

    reviewbotreviewbot

    E501 line too long (81 > 79 characters)

    reviewbotreviewbot

    F821 undefined name 'newline'

    reviewbotreviewbot

    F821 undefined name 'line_endings'

    reviewbotreviewbot
    Checks run (1 failed, 1 succeeded)
    flake8 failed.
    JSHint passed.

    flake8

    chipx86
    Review request changed
    Change Summary:

    Fixed tabs that snuck in from some copy/pastes while building strings for unit tests.

    Commits:
    Summary ID
    Fix issues with multi-byte encodings in content sections.
    The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.
    665a0e757004b20d03716a5743cef3e941575c8c
    Fix issues with multi-byte encodings in content sections.
    The original implementations of the reader and writer didn't properly handle multi-byte encodings, such as UTF-16 or UTF-32 strings. The reason for this came down to newlines. We were looking for and adding newlines based on their 8-bit representation, assuming that all content would end in those, or that we could add them to the end of each line while splitting. This didn't actually work, and would result in failures in multi-byte encodings. A few things had to be done to address this: 1. Some design flaws in the text utilities, which made assumptions on lengths, offsets, and the presence of newline characters, had to be fixed. 2. We no longer store our newline constants as byte strings, but rather Unicode strings. They're then encoded as needed to ensure that they're in a compatible format. BOMs are stripped from these, to avoid further corruption of content. 3. The order in which we encode, decode, and process strings for content sections has changed a bit. We add on a missing newline before encoding, rather than after. We encode or decode newlines at the same time as the content strings. Unit tests were updated to check different multi-byte encodings for all the content sections, making sure they are written and parsed correctly.
    f68b4e116f265909781f1fcca153f335955aaf09

    Checks run (1 failed, 1 succeeded)

    flake8 failed.
    JSHint passed.

    flake8

    chipx86
    chipx86
    david
    1. Ship It!
    2. 
        
    chipx86
    Review request changed
    Status:
    Completed
    Change Summary:
    Pushed to master (cec15fe)