classification
Title: eval() uses latin-1 to decode str
Type: behavior Stage: resolved
Components: Documentation Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: benjamin.peterson, docs@python, pitrou, python-dev, r.david.murray, serhiy.storchaka, valhallasw, vstinner
Priority: normal Keywords: patch

Created on 2013-08-28 18:18 by valhallasw, last changed 2013-09-01 23:06 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
doc_18870.diff valhallasw, 2013-08-29 20:25 review
Messages (10)
msg196398 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2013-08-28 18:18
Steps to reproduce:
-------------------
>>> eval("u'ä'")
# in an utf-8 console, so this is equivalent to
>>> eval("u'\xc3\xa4'")

Actual result:
----------------
u'\xc3\xa4'
# i.e.: u'ä'

Expected result:
-----------------
SyntaxError: Non-ASCII character '\xc3' in file <string> on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
(which is what would happen if it was in a source file)

Or, alternatively:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
(which is what results from decoding the str with sys.getdefaultencoding())

Instead, the string is interpreted as latin-1. The same happens for ast.literal_eval - even calling compile() directly.

In python 3.2, this is the result, as utf-8 is used as default source encoding:
>>> eval(b"'\xc3\xa4'")
'ä'

Workarounds
----------
>>> eval("# encoding: utf-8\nu'\xc3\xa4'")
u'\xe4'
>>> eval("u'\xc3\xa4'".decode('utf-8'))
u'\xe4'


I understand this might be considered a WONTFIX, as it would change behavior some people might depend on. Nonetheless, documenting this explicitly seems a sensible thing to do.
msg196407 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-28 19:45
I don't think it is even "won't fix".  Your "workarounds" are just the way you need to feed non-latin1 text into Python2.  Since the default source encoding in python2 is latin-1, and that is documented, I'm not sure what additional documentation you want?
msg196413 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-28 20:20
> Since the default source encoding in python2 is latin-1

Mmh, really? According to PEP 263:

    Python will default to ASCII as standard encoding if no other
    encoding hints are given.

And indeed when trying Merlijn's code in a .py file rather than an eval() call, I get:

SyntaxError: Non-ASCII character '\xc3' in file tcc.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
msg196414 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-08-28 20:23
Where / how do you type the eval() instruction? In IDLE, in the interactive prompt (>>>) or in a script (test.py)? What is your OS and what is your locale encoding?

Example:

$ python -m platform
Linux-3.9.4-200.fc18.x86_64-x86_64-with-fedora-18-Spherical_Cow
$ python -c 'import locale; print(locale.getpreferredencoding())'
UTF-8
msg196415 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-28 20:24
$ python
Python 2.7.4 (default, Apr 19 2013, 18:28:01) 
[GCC 4.7.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> eval("u'ä'")
u'\xc3\xa4'
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
msg196429 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-28 21:53
Heh.  Obviously I've forgotten some things about python2.  I could have sworn the default was latin-1, but perhaps that was just the stdlib standard for coding cookies?  I don't use python2 much any more...
msg196432 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-08-28 22:46
Yeah, this is an obnoxious bug. Too late to fix in Python 3, though. I would take a doc patch.
msg196486 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2013-08-29 20:25
On the lowest level, this affects exec, eval(), compile() and input() (!). On a higher level, more modules are affected:

modules ast, codeop, compiler, cProfile, dis, distutils (not sure), doctest, idlelib, ihooks, pdb, pkgutil, plat-mac, py_compile, rexec, runpy and timeit all call compile()

modules dbd, compiler, gettext, idlelib, lib2to3, lib-tk.turtle, logging, mhlib, pdb, plat-irix5, plat-mac, rexec, rlcompleter and warnings all call eval()

and modules Bastion, bdb, code, collections, cProfile, distutils, doctest, idlelib, ihooks, imputil, pdb, plat-irix5, plat-irix6, plat-mac, profile, rexec, site, timeit and trace all call exec.

Not all of them necessarily take user-supplied code - I haven't checked that.


After checking tests/test_pep263.py, it seems the behavior is a bit more complicated than I initially thought: a str parameter is considered latin-1 unless either
 a) an utf-8 bom is present, in which case it is considered utf-8
 b) an # encoding: XXX  line is present, in which case it is considered
    to be in that encoding

In any case, I have attached a doc patch for exec, eval(), compile(), and ast.literal_eval(), because I think these are the most widely used. I think input() does not need a doc change because it explicitly refers to eval().

I ignored the subtleties noted above for the doc patch, simplifying to 'pass either a Unicode or a latin-1 encoded string'.
msg196493 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-29 21:06
See also issue15809.
msg196749 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-09-01 23:06
New changeset 869cbcabb934 by Benjamin Peterson in branch '2.7':
document that various functions that parse from source will interpret things as latin-1 (closes #18870)
http://hg.python.org/cpython/rev/869cbcabb934
History
Date User Action Args
2013-09-01 23:06:59python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg196749

resolution: fixed
stage: resolved
2013-08-29 21:06:07serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg196493
2013-08-29 20:25:53valhallaswsetfiles: + doc_18870.diff
keywords: + patch
messages: + msg196486
2013-08-28 22:46:30benjamin.petersonsetmessages: + msg196432
components: + Documentation, - Interpreter Core
2013-08-28 21:53:22r.david.murraysetmessages: + msg196429
2013-08-28 20:24:39pitrousetmessages: + msg196415
2013-08-28 20:23:29vstinnersetnosy: + vstinner
messages: + msg196414
2013-08-28 20:20:31pitrousetnosy: + pitrou, benjamin.peterson
messages: + msg196413

components: + Interpreter Core, - Documentation
type: behavior
2013-08-28 19:45:07r.david.murraysetnosy: + docs@python, r.david.murray
messages: + msg196407

assignee: docs@python
components: + Documentation
2013-08-28 18:18:22valhallaswcreate