Summary

Add a streaming parser for DiffX files.

Review Request #11712 — Created July 7, 2021 and submitted July 20, 2021, 2:43 p.m.

Information

Owner

chipx86

Repository

DiffX

Branch

master

Bugs

Depends On

Reviewers

Groups

diffx

People

Description

This introduces diffx.reader.DiffXReader, a streaming parser for the
DiffX file format. This is a low-level interface for DiffX files, which
is able to read from a file stream (such as a local file, HTTP response,
or memory-backed stream) and parse and return each section in the DiffX
files according to the specification.

The parser acts as a generator, and will provide dictionaries containing
information on each section of the DiffX file. This includes options,
metadata or text/diff content, the section ID, section type, and section
hierarchy level. Consumers can build upon this to work with the data as
it comes in, or to easily convert it into another representation.

As a streaming parser, this does not keep much state around between
sections. This makes it ideal for working with very large DiffX files,
or reading from a stream that may incrementally provide new content.
However, it also means that a parsing error may not manifest until later
in the stream, after the consumer has already handled some content. An
object-based implementation is in the works that will address this.

As this is a reference implementation, the parser is very strict about
conforming to the specification, when it comes to header structure,
characters in header options, lengths, and valid option values. It does
take into consideration different newline formats and ignores extra
newlines before/after sections (including at the beginning or end of the
file).

At this stage, DiffXReader should be usable for any production use of
the DiffX format. An accompanying DiffXWriter, and implementations
representing DiffX through an object model, are in the works.

Testing Done

Unit tests pass on Python 2 and 3.

Put the reader through its paces by testing a bunch of valid and invalid
DiffX files generated by hand and by the upcoming DiffXWriter.

Commits

Summary	ID
Add a streaming parser for DiffX files. This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.	c6d5c1090e22fba90f0a63753dc6ff935983e1b7

Summary

Add a streaming parser for DiffX files.

This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.

c6d5c1090e22fba90f0a63753dc6ff935983e1b7

Issues

Description	From	Last Updated
E501 line too long (81 > 79 characters)	reviewbot	July 7, 2021, 5:42 p.m.
E302 expected 2 blank lines, found 1	reviewbot	July 9, 2021, 6:03 p.m.
These attributes don't exist. Should these be Section.MAIN, etc?	david	July 20, 2021, 1:32 a.m.
It's just one element, but maybe do this before the list comprehension? Seems weird to process first and then throw …	david	July 20, 2021, 1:32 a.m.

flake8 failed.

JSHint passed.

flake8

python/diffx/tests/test_reader.py (Diff revision 1)
The issue has been resolved. Show all issues
```
E501 line too long (81 > 79 characters)
```

Change Summary:


Split off all section definitions into a new diffx.sections module.
Fixed a line length issue.
Fixed the type description for the fp argument in the DiffXReader constructor.

Commits:

	Summary	ID
	Add a streaming parser for DiffX files. This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.	3fcf7ffe99cb7e5ffc57859ab8a1885a1c141d61
	Add a streaming parser for DiffX files. This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.	713961b42b3530abeb748e37913e0e87e507b24b

Diff:

Revision 2 (+3894)

Show changes

	python/setup.py
	python/diffx/errors.py
	python/diffx/reader.py
	python/diffx/sections.py
	python/diffx/tests/test_reader.py
	python/diffx/utils/__init__.py
	python/diffx/utils/text.py

Checks run (2 succeeded)

flake8 passed.

JSHint passed.

Change Summary:

Removed an extra blank line.

Commits:

	Summary	ID
	Add a streaming parser for DiffX files. This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.	713961b42b3530abeb748e37913e0e87e507b24b
	Add a streaming parser for DiffX files. This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.	cea01bd63d8449d4435e43ffa7d77d5dd0d243b5

Diff:

Revision 3 (+3892)

Show changes

	python/setup.py
	python/diffx/errors.py
	python/diffx/reader.py
	python/diffx/sections.py
	python/diffx/tests/test_reader.py
	python/diffx/utils/__init__.py
	python/diffx/utils/text.py

Checks run (1 failed, 1 succeeded)

flake8 failed.

JSHint passed.

flake8

python/diffx/sections.py (Diff revision 3)
The issue has been resolved. Show all issues
```
E302 expected 2 blank lines, found 1
```

Change Summary:

Added a missing unicode_literals import.

Commits:

	Summary	ID
	Add a streaming parser for DiffX files. This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.	cea01bd63d8449d4435e43ffa7d77d5dd0d243b5
	Add a streaming parser for DiffX files. This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.	7ef5816a9e0fff74a44308ec9802def1577bd2ec

Diff:

Revision 4 (+3898)

Show changes

	python/setup.py
	python/diffx/errors.py
	python/diffx/reader.py
	python/diffx/sections.py
	python/diffx/tests/test_reader.py
	python/diffx/utils/__init__.py
	python/diffx/utils/text.py

Checks run (2 succeeded)

flake8 passed.

JSHint passed.

python/diffx/reader.py (Diff revision 4)

The issue has been resolved. Show all issues

These attributes don't exist. Should these be Section.MAIN, etc?

chipx86 July 12, 2021, 3:54 p.m.

Yep. Got a fix in my tree. I meant to update the diff.

python/diffx/utils/text.py (Diff revision 4)

The issue has been dropped. Show all issues

It's just one element, but maybe do this before the list comprehension? Seems weird to process first and then throw away data second.

chipx86 July 12, 2021, 3:54 p.m.

This function gets some big enough changes in /r/11714 (the current version has some faults to it), so I'll investigate doing it in that change.

Change Summary:


Fixed references to section IDs in the docstrings.
Added a missing description of the diff field in section results.

Commits:

	Summary	ID
	Add a streaming parser for DiffX files. This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.	7ef5816a9e0fff74a44308ec9802def1577bd2ec
	Add a streaming parser for DiffX files. This introduces `diffx.reader.DiffXReader`, a streaming parser for the DiffX file format. This is a low-level interface for DiffX files, which is able to read from a file stream (such as a local file, HTTP response, or memory-backed stream) and parse and return each section in the DiffX files according to the specification. The parser acts as a generator, and will provide dictionaries containing information on each section of the DiffX file. This includes options, metadata or text/diff content, the section ID, section type, and section hierarchy level. Consumers can build upon this to work with the data as it comes in, or to easily convert it into another representation. As a streaming parser, this does not keep much state around between sections. This makes it ideal for working with very large DiffX files, or reading from a stream that may incrementally provide new content. However, it also means that a parsing error may not manifest until later in the stream, after the consumer has already handled some content. An object-based implementation is in the works that will address this. As this is a reference implementation, the parser is very strict about conforming to the specification, when it comes to header structure, characters in header options, lengths, and valid option values. It does take into consideration different newline formats and ignores extra newlines before/after sections (including at the beginning or end of the file). At this stage, `DiffXReader` should be usable for any production use of the DiffX format. An accompanying `DiffXWriter`, and implementations representing DiffX through an object model, are in the works.	c6d5c1090e22fba90f0a63753dc6ff935983e1b7

Diff:

Revision 5 (+3912)

Show changes

	python/setup.py
	python/diffx/errors.py
	python/diffx/reader.py
	python/diffx/sections.py
	python/diffx/tests/test_reader.py
	python/diffx/utils/__init__.py
	python/diffx/utils/text.py

Checks run (2 succeeded)

flake8 passed.

JSHint passed.

Ship it!

```
Ship It!
```

Status:: Completed
Change Summary:: Pushed to master (13eaec0)