Refactor encode_multipart_formdata and tests to be easier to read and maintain

Review Request #9653 — Created Feb. 16, 2018 and discarded

Information

RBTools
master

Reviewers

Refactor encode_multipart_formdata and tests to be easier to read and maintain.

Will update with the remaining fixes.

Description From Last Updated

This should be part of the same import group.

chipx86chipx86

Blank line between statements/blocks and other blocks.

chipx86chipx86

Docstrings for functions must meet a standard format described at https://www.notion.so/reviewboard/Writing-Codebase-Documentation-e16312b5f061437cb73cbfa369ac3cb5

chipx86chipx86

BytesIO() was the correct way to go. Concatenating strings in Python is slow and bad for garbage collection. You almost …

chipx86chipx86

Blank line between statements and comments.

chipx86chipx86

Blank line between these. There's more in the file. Can you go through and fix those up? If there's others …

chipx86chipx86

We always want to use explicit encodings in calls to encode() and decode().

chipx86chipx86

We use % instead of .format().

chipx86chipx86

This is not Python 2.7-compatible. All code must work on Python 2.7 as well, and we want to see testing …

chipx86chipx86

This is not efficient. This should be reverted.

chipx86chipx86
solarmist
chipx86
  1. Going through this, I noticed that you're introducing code that's specifically for Python 3. We won't be switching exclusively to Python 3.

    For every single change, we're going to want to see testing for Python 2.7, 3.4, 3.5, and 3.6, both manual and unit tests. I know the unit tests may not run on Python 3.x versions as of certain points in the change, but every single change must be able to be landed without the changes that follow it without breaking anything in RBTools for Python 2.7.

    1. That was an issue in my env. Python2 was somehow incorrectly linked to python3.6.

  2. rbtools/api/request.py (Diff revision 1)
     
     
     
     
    Show all issues

    This should be part of the same import group.

  3. rbtools/api/request.py (Diff revision 1)
     
     
     
    Show all issues

    Blank line between statements/blocks and other blocks.

  4. rbtools/api/request.py (Diff revision 1)
     
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    Docstrings for functions must meet a standard format described at https://www.notion.so/reviewboard/Writing-Codebase-Documentation-e16312b5f061437cb73cbfa369ac3cb5

  5. rbtools/api/request.py (Diff revision 1)
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    Show all issues

    BytesIO() was the correct way to go. Concatenating strings in Python is slow and bad for garbage collection. You almost never want to concatenate strings if you're going to do more than a few. Instead, you always want to use something like BytesIO(), StringIO(), or add them to a list and join them back into a string.

    1. I'm dropping this as a moot point. As a cli tool that makes a network call any string processing done will be minscule compared to a single network call.

    2. The problem has to do when working with large amounts of data. File attachment uploads, for instance, will take a lot more memory with string concatenation, resulting in multiple copies of the file attachment data in memory at one time, just due to the way that Python works.

      Say we upload two files that's 5MB and 10MB in size, respectively. During the loop in which we set the first file, we'll end up with a resulting string containing all the previous form-data content and the 5MB file. The next string concatenation will allocate another string containing the size of the original string + the new content, meaning another ~5MB string in memory, alongside the original. The reference then gets set to the new one, and the old one is flagged for garbage collection. We now have ~10MB in memory. The next bit of data that gets added will allocate another ~5MB string, giving us ~15MB in memory. And then ~20MB. When we start building the payload for the new file, that begins to grow further.

      At some point during this, garbage collection is likely to kick in and start freeing up older strings, but given the sizes, this isn't exactly instantaneous, and it still means the process has allocated much more memory than needed.

      Going the BytesIO route, we have an efficient way of storing this data without all the extra allocations. It's faster and more memory-efficient.

      Because of that, the original mechanism for building these payloads must remain.

      Also, while not as big a deal, it's probably not worth defining a new function inline. This has to be generated every time we call encode_multipart_formdata, and I don't think it's needed.

      I'd really like to use the original code for this function, converted for BytesIO, instead of the new mechanism. The old method is well-tested.

  6. rbtools/api/request.py (Diff revision 1)
     
     
     
    Show all issues

    Blank line between statements and comments.

  7. rbtools/api/request.py (Diff revision 1)
     
     
     
    Show all issues

    Blank line between these.

    There's more in the file. Can you go through and fix those up? If there's others like this in other files, those will also need to be fixed.

  8. rbtools/api/request.py (Diff revision 1)
     
     
    Show all issues

    We always want to use explicit encodings in calls to encode() and decode().

  9. rbtools/api/request.py (Diff revision 1)
     
     
    Show all issues

    We use % instead of .format().

    1. That's fine, but what's the reasoning behind that?

      As noted in the docs https://docs.python.org/3/library/stdtypes.html#string-formatting

      Note The formatting operations described here exhibit a variety of quirks that lead to a number of common errors (such as failing to display 
      tuples and dictionaries correctly). Using the newer formatted string literals or the str.format() interface helps avoid these errors. These 
      alternatives also provide more powerful, flexible and extensible approaches to formatting text.
      

      And from python 3.6 on .format() with f-strings is a much better choice for most applications.

    2. It's just what we do. Partially for historical reasons, partially because .format() is just more wordy. I know this is debatable, but it's how we format strings.

      We can't rely on Python 3.6. It won't work with Python 2.7. We're likely to support Python 2.7 in RBTools long past the end-of-life date, as we're still trying to get companies off of Python 2.4 (enterprises tend to move very slow, and we have support contracts with them mandating support).

    3. Fair enough. I'll see what I can do, but I was specifically having an issue with printf style formatting for the basic auth header and maintaining compatibility between python 2 and 3.

    4. I can help with that. There's some gotchas, definitely. If you can show me what data's going into it and what you're getting out, I can help figure out what's going on. Also, what version of Python 3 are you testing with? (Ultimately we're probably going to start with 3.4 through 3.6 compatibility.)

  10. rbtools/api/request.py (Diff revision 1)
     
     
    Show all issues

    This is not Python 2.7-compatible.

    All code must work on Python 2.7 as well, and we want to see testing that assures this.

    1. A bug in my venv linked to python2 incorrectly when I installed the python 3 venv first.

      It's fixed now.

  11. rbtools/api/request.py (Diff revision 1)
     
     
    Show all issues

    This is not efficient. This should be reverted.

    1. Also dropping this as a moot point. I could do this a million times and still have time to spare compared to a single network call.

      Also this was directly re-used from the more recent versions of the standard libraries where this prevents python2/3 edge cases that cause a literal 'b' to be formatted into the string.

    2. That's true, but it's how we work with strings. We want to keep consistency, because it helps avoid errors with new changes, and we also don't want changes to regress logic without needing to. I'm going to ask to revert the string building to use the format string here.

    3. As I mentioned in the previous comment I'll see what I can do, but I adopted this code after trying several variations of printf formatting and getting string type labels 'u' in py2 or 'b' in py3.

  12. 
      
solarmist
solarmist
Review request changed
Status:
Discarded