classification
Title: built-in compile() should take encoding option.
Type: enhancement Stage: test needed
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, Trundle, benjamin.peterson, facundobatista, naoki
Priority: normal Keywords: patch

Created on 2009-05-03 01:55 by naoki, last changed 2010-08-25 04:12 by benjamin.peterson. This issue is now closed.

Files
File name Uploaded Description Edit
compile_with_encoding.patch naoki, 2009-10-03 16:23 add encoding option and test.
Messages (6)
msg86994 - (view) Author: INADA Naoki (naoki) * Date: 2009-05-03 01:55
The built-in compile() expects source is encoded in utf-8.
This behavior make it harder to implement alternative shell
like IDLE and IPython. (http://bugs.python.org/issue1542677 and
https://bugs.launchpad.net/ipython/+bug/339642 are related bugs.)

Below is current compile() behavior.

# Python's interactive shell in Windows cp932 console.
>>> "あ"
'\x82\xa0'
>>> u"あ"
u'\u3042'

# compile() fails to decode str.
>>> code = compile('u"あ"', '__interactive__', 'single')
>>> exec code
u'\x82\xa0'  # u'\u3042' expected.

# compile() encodes unicode to utf-8.
>>> code = compile(u'"あ"', '__interactive__', 'single')
>>> exec code
'\xe3\x81\x82' # '\x82\xa0' (cp932) wanted, but I get utf-8.

Currentry, using PEP0263 like below is needed to get compile
code in expected encoding.

>>> code = compile('# coding: cp932\n%s' % ('"あ"',), '__interactive__', 
'single')
>>> exec code
'\x82\xa0'
>>> code = compile('# coding: cp932\n%s' % ('u"あ"',), '__interactive__', 
'single')
>>> exec code
u'\u3042'

But I feel compile() with PEP0263 is bit dirty hack.
I think adding a 'encoding' argument that have a 'utf-8' as default value to
compile() is cleaner way and it doesn't break backward compatibility.

Following example is describe behavior of compile() with encoding option.

# coding: utf-8 (in utf-8 context)
code = compile('"あ"', '__foo.py', 'single')
exec code #=> '\xe3\x81\x82'

code = compile('"あ"', '__foo.py', 'single', encoding='cp932') => 
UnicodeDecodeError

code = compile(u'"あ"', '__foo.py', 'single')
exec code #=> '\xe3\x81\x82'

code = compile(u'"あ"', '__foo.py', 'single', encoding='cp932')
exec code #=> '\x82\xa0'
msg93501 - (view) Author: INADA Naoki (naoki) * Date: 2009-10-03 16:23
add sample implementation.
msg93897 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-10-12 14:55
The patch as it currently stands is unacceptable because it changes
public APIs.
msg114814 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-08-24 20:12
Anyone interested in producing an updated patch?
msg114879 - (view) Author: INADA Naoki (naoki) * Date: 2010-08-25 03:11
This problem is not heavy on Python 3.
Because Python 3's byte string can't contain non-ASCII string directory.
So passing unicode string to the compile() is good enough for all cases I can imagine.
msg114880 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-08-25 04:12
I'll close this then.
History
Date User Action Args
2010-08-25 04:12:08benjamin.petersonsetstatus: open -> closed
resolution: rejected
messages: + msg114880
2010-08-25 03:11:15naokisetmessages: + msg114879
2010-08-24 20:12:47BreamoreBoysetnosy: + BreamoreBoy

messages: + msg114814
versions: + Python 3.2, - Python 2.7
2009-11-24 22:45:22Trundlesetnosy: + Trundle
2009-10-12 14:55:23benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg93897
2009-10-12 13:03:16facundobatistasetnosy: + facundobatista
2009-10-03 16:23:04naokisetfiles: + compile_with_encoding.patch
keywords: + patch
messages: + msg93501
2009-05-08 19:08:41ajaksu2linkissue1542677 dependencies
2009-05-08 19:08:11ajaksu2setpriority: normal
versions: - Python 2.6
components: + Interpreter Core, Unicode, - None
stage: test needed
2009-05-03 01:55:23naokicreate