Message 196398 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	valhallasw
Recipients	valhallasw
Date	2013-08-28.18:18:21
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1377713902.12.0.966249722349.issue18870@psf.upfronthosting.co.za>
In-reply-to

Content
Steps to reproduce: ------------------- >>> eval("u'ä'") # in an utf-8 console, so this is equivalent to >>> eval("u'\xc3\xa4'") Actual result: ---------------- u'\xc3\xa4' # i.e.: u'Ã¤' Expected result: ----------------- SyntaxError: Non-ASCII character '\xc3' in file <string> on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details (which is what would happen if it was in a source file) Or, alternatively: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128) (which is what results from decoding the str with sys.getdefaultencoding()) Instead, the string is interpreted as latin-1. The same happens for ast.literal_eval - even calling compile() directly. In python 3.2, this is the result, as utf-8 is used as default source encoding: >>> eval(b"'\xc3\xa4'") 'ä' Workarounds ---------- >>> eval("# encoding: utf-8\nu'\xc3\xa4'") u'\xe4' >>> eval("u'\xc3\xa4'".decode('utf-8')) u'\xe4' I understand this might be considered a WONTFIX, as it would change behavior some people might depend on. Nonetheless, documenting this explicitly seems a sensible thing to do.

Steps to reproduce:
-------------------
>>> eval("u'ä'")
# in an utf-8 console, so this is equivalent to
>>> eval("u'\xc3\xa4'")

Actual result:
----------------
u'\xc3\xa4'
# i.e.: u'Ã¤'

Expected result:
-----------------
SyntaxError: Non-ASCII character '\xc3' in file <string> on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
(which is what would happen if it was in a source file)

Or, alternatively:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
(which is what results from decoding the str with sys.getdefaultencoding())

Instead, the string is interpreted as latin-1. The same happens for ast.literal_eval - even calling compile() directly.

In python 3.2, this is the result, as utf-8 is used as default source encoding:
>>> eval(b"'\xc3\xa4'")
'ä'

Workarounds
----------
>>> eval("# encoding: utf-8\nu'\xc3\xa4'")
u'\xe4'
>>> eval("u'\xc3\xa4'".decode('utf-8'))
u'\xe4'


I understand this might be considered a WONTFIX, as it would change behavior some people might depend on. Nonetheless, documenting this explicitly seems a sensible thing to do.

History
Date	User	Action	Args
2013-08-28 18:18:22	valhallasw	set	recipients: + valhallasw
2013-08-28 18:18:22	valhallasw	set	messageid: <1377713902.12.0.966249722349.issue18870@psf.upfronthosting.co.za>
2013-08-28 18:18:22	valhallasw	link	issue18870 messages
2013-08-28 18:18:21	valhallasw	create