Title: eval() uses latin-1 to decode str
Messages (10)
msg196398 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2013-08-28 18:18
Steps to reproduce:
>>> eval("u'ä'")
# in an utf-8 console, so this is equivalent to
>>> eval("u'\xc3\xa4'")

Actual result:
# i.e.: u'ä'

Expected result:
SyntaxError: Non-ASCII character '\xc3' in file <string> on line 1, but no encoding declared; see for details
(which is what would happen if it was in a source file)

Or, alternatively:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
(which is what results from decoding the str with sys.getdefaultencoding())

Instead, the string is interpreted as latin-1. The same happens for ast.literal_eval - even calling compile() directly.

In python 3.2, this is the result, as utf-8 is used as default source encoding:
>>> eval(b"'\xc3\xa4'")

>>> eval("# encoding: utf-8\nu'\xc3\xa4'")
>>> eval("u'\xc3\xa4'".decode('utf-8'))

I understand this might be considered a WONTFIX, as it would change behavior some people might depend on. Nonetheless, documenting this explicitly seems a sensible thing to do.
msg196407 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-28 19:45
I don't think it is even "won't fix".  Your "workarounds" are just the way you need to feed non-latin1 text into Python2.  Since the default source encoding in python2 is latin-1, and that is documented, I'm not sure what additional documentation you want?
msg196413 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-28 20:20
> Since the default source encoding in python2 is latin-1

Mmh, really? According to PEP 263:

    Python will default to ASCII as standard encoding if no other
    encoding hints are given.

And indeed when trying Merlijn's code in a .py file rather than an eval() call, I get:

SyntaxError: Non-ASCII character '\xc3' in file on line 1, but no encoding declared; see for details
msg196414 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-08-28 20:23
Where / how do you type the eval() instruction? In IDLE, in the interactive prompt (>>>) or in a script ( What is your OS and what is your locale encoding?


$ python -m platform
$ python -c 'import locale; print(locale.getpreferredencoding())'
msg196415 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-28 20:24
$ python
Python 2.7.4 (default, Apr 19 2013, 18:28:01) 
[GCC 4.7.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> eval("u'ä'")
>>> import locale
>>> locale.getpreferredencoding()
msg196429 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-28 21:53
Heh.  Obviously I've forgotten some things about python2.  I could have sworn the default was latin-1, but perhaps that was just the stdlib standard for coding cookies?  I don't use python2 much any more...
msg196432 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-08-28 22:46
Yeah, this is an obnoxious bug. Too late to fix in Python 3, though. I would take a doc patch.
msg196486 - (view) Author: Merlijn van Deen (valhallasw) * Date: 2013-08-29 20:25
On the lowest level, this affects exec, eval(), compile() and input() (!). On a higher level, more modules are affected:

modules ast, codeop, compiler, cProfile, dis, distutils (not sure), doctest, idlelib, ihooks, pdb, pkgutil, plat-mac, py_compile, rexec, runpy and timeit all call compile()

modules dbd, compiler, gettext, idlelib, lib2to3, lib-tk.turtle, logging, mhlib, pdb, plat-irix5, plat-mac, rexec, rlcompleter and warnings all call eval()

and modules Bastion, bdb, code, collections, cProfile, distutils, doctest, idlelib, ihooks, imputil, pdb, plat-irix5, plat-irix6, plat-mac, profile, rexec, site, timeit and trace all call exec.

Not all of them necessarily take user-supplied code - I haven't checked that.

After checking tests/, it seems the behavior is a bit more complicated than I initially thought: a str parameter is considered latin-1 unless either
 a) an utf-8 bom is present, in which case it is considered utf-8
 b) an # encoding: XXX  line is present, in which case it is considered
    to be in that encoding

In any case, I have attached a doc patch for exec, eval(), compile(), and ast.literal_eval(), because I think these are the most widely used. I think input() does not need a doc change because it explicitly refers to eval().

I ignored the subtleties noted above for the doc patch, simplifying to 'pass either a Unicode or a latin-1 encoded string'.
msg196493 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-29 21:06
See also issue15809.
msg196749 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-09-01 23:06
New changeset 869cbcabb934 by Benjamin Peterson in branch '2.7':
document that various functions that parse from source will interpret things as latin-1 (closes #18870)
