post-review: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4931: ordinal not in range(128)

Review Request #1456 — Created March 5, 2010 and discarded — Latest diff uploaded

Information

RBTools

Reviewers

The issue is with:

   return content_type, content.encode('utf-8')

content contains the diffs for all the files in the change set.  Python defaults to an ascii codec when doing the conversion to utf-8, and chokes on the i-trema in Loïc Minier:

  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54556: ordinal not in range(128)

I forced the codec to utf-8 with an explicit .decode().  Python would then choke on the 0xA0 character (NBSP) as it is not in utf-8 format (note that vi does not display it correctly either).  

  UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 140283: unexpected code byte

The solution is to use 'replace' so that invalid characters are transformed into a ?.  See sample below.

Note: since utf-8 is a superset of ascii, this change is backward compatible.
Tried on sample diffs as shown in the sceenshot.  Verified the complete (and very large) diffs from user could be uploaded on the staging server.
Created review on the production server using my local version of post-review.
    Loading...