post-review: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4931: ordinal not in range(128)

Review Request #1456 — Created March 5, 2010 and discarded




The issue is with:

   return content_type, content.encode('utf-8')

content contains the diffs for all the files in the change set.  Python defaults to an ascii codec when doing the conversion to utf-8, and chokes on the i-trema in Loïc Minier:

  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54556: ordinal not in range(128)

I forced the codec to utf-8 with an explicit .decode().  Python would then choke on the 0xA0 character (NBSP) as it is not in utf-8 format (note that vi does not display it correctly either).  

  UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 140283: unexpected code byte

The solution is to use 'replace' so that invalid characters are transformed into a ?.  See sample below.

Note: since utf-8 is a superset of ascii, this change is backward compatible.
Tried on sample diffs as shown in the sceenshot.  Verified the complete (and very large) diffs from user could be uploaded on the staging server.
Created review on the production server using my local version of post-review.
  1. I don't really like transforming the diff. It wouldn't necessarily even apply.
    The encode problem is causing several issues. I would be much happier if we removed the encode('utf-8') and fixed the size computation issue separately. It's clearly breaking too much.
    1. I was not too concerned about modifying the diffs, as this is not the source code, but I see your point about applying.  (I was thinking about doing the same transformation on the base code before patching, but it may take us too far).
      Taking a step back, I agree we should remove the "encode('utf-8')".  Is there an issue raised for the size computation?  I guess you refer to 'Content-Length': str(len(body))
    2. A modified diff has to apply against the source code, and that could fail if transformed even a little bit.
      Right, the Content-Length computation. I'd really like to see the diff that broke that which we needed the encode('utf-8') for. For now, I'm just going to revert that change.
    3. Since the change in post-review is reverted, I'm discarding this review.