I don't really like transforming the diff. It wouldn't necessarily even apply. The encode problem is causing several issues. I would be much happier if we removed the encode('utf-8') and fixed the size computation issue separately. It's clearly breaking too much.
post-review: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4931: ordinal not in range(128)
Review Request #1456 — Created March 5, 2010 and discarded
The issue is with: return content_type, content.encode('utf-8') content contains the diffs for all the files in the change set. Python defaults to an ascii codec when doing the conversion to utf-8, and chokes on the i-trema in Loïc Minier: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54556: ordinal not in range(128) I forced the codec to utf-8 with an explicit .decode(). Python would then choke on the 0xA0 character (NBSP) as it is not in utf-8 format (note that vi does not display it correctly either). UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 140283: unexpected code byte The solution is to use 'replace' so that invalid characters are transformed into a ?. See sample below. Note: since utf-8 is a superset of ascii, this change is backward compatible.
Tried on sample diffs as shown in the sceenshot. Verified the complete (and very large) diffs from user could be uploaded on the staging server. Created review on the production server using my local version of post-review.