Summary

Fix processing of non-UTF-8-encoded files and diffs

Review Request #872 — Created May 20, 2009 and submitted

Information

Owner

morozov

Repository

Review Board SVN (deprecated)

Branch

Bugs

Depends On

Reviewers

Groups

reviewboard

People

Description

When files in a repository is encoded with a non-ASCII, non-UTF-8 encoding, a special configuration option, repository encoding is required. However even if this option is provided files are still processed incorrectly by diffviewer.

convert_to_utf8() correctly returns unicode strings for byte strings which can be decoded as UTF-8 (i.e. ASCII and actual UTF-8) and further processing (e.g. by pygments) assumes unicode strings as parameters. However for non-UTF-8 strings the function returned byte strings which effectively break pygments.

The patch
1. renames convert_to_utf8() to convert_to_unicode() to reflect its real purpose :)
2. return unicode instead of str for strings in a user-specified encoding

Testing Done

Description:: ~
When files in a repository is encoded with a non-ASCII, non-UTF-8 encoding, a special configuration option, repository encoding is required. However even if such an option is provided files are still processed incorrectly by diffviewer.
~
When files in a repository is encoded with a non-ASCII, non-UTF-8 encoding, a special configuration option, repository encoding is required. However even if this option is provided files are still processed incorrectly by diffviewer.

convert_to_utf8() correctly returns unicode strings for byte strings which can be decoded as UTF-8 (i.e. ASCII and actual UTF-8) and further processing (e.g. by pygments) assumes unicode strings as parameters. However for non-UTF-8 strings the function returned byte strings which effectively break pygments.

The patch
1. renames convert_to_utf8() to convert_to_unicode() to reflect its real purpose :)
2. return unicode instead of str for strings in a user-specified encoding

Would you mind providing some test cases that handle this conversion, having tests that broke in the old code and are working in the new code? I want to make sure there aren't regressions.

it00h

Oct. 16, 2009, 12:22 p.m.

I'm trying out Review Board 1.0. I using Japanese character shift-jis and set repository settings of encoding to 'sjis,Shift_JIS,CP932,euc-jp'. But This not work and get mojibake.

To get work correctory I need patch in this report. 

I'm not sure how Python works with character encoding. But I want this fixed. 
Can I provide a what some kind of data?

chipx86

Oct. 16, 2009, 4:34 p.m.

Review Board expects and practically requires the database and browser to be UTF-8, regardless of the repository encodings.

More specific information on what's wrong and how it's manifesting would help a lot. Though, this is best done in a bug report.

~		When files in a repository is encoded with a non-ASCII, non-UTF-8 encoding, a special configuration option, repository encoding is required. However even if such an option is provided files are still processed incorrectly by diffviewer.
	~	When files in a repository is encoded with a non-ASCII, non-UTF-8 encoding, a special configuration option, repository encoding is required. However even if this option is provided files are still processed incorrectly by diffviewer.

		convert_to_utf8() correctly returns unicode strings for byte strings which can be decoded as UTF-8 (i.e. ASCII and actual UTF-8) and further processing (e.g. by pygments) assumes unicode strings as parameters. However for non-UTF-8 strings the function returned byte strings which effectively break pygments.

		The patch
		1. renames convert_to_utf8() to convert_to_unicode() to reflect its real purpose :)
		2. return unicode instead of str for strings in a user-specified encoding