classification
Title: compiler module doesn't support unicode characters in laiter
Type: Stage:
Components: Interpreter Core Versions: Python 2.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: nascheme Nosy List: BreamoreBoy, dcjim, jhylton, lemburg, mwh, nascheme, nnorwitz
Priority: normal Keywords:

Created on 2004-07-28 14:00 by dcjim, last changed 2010-08-19 15:38 by BreamoreBoy. This issue is now closed.

Messages (7)
msg21835 - (view) Author: Jim Fulton (dcjim) (Python triager) Date: 2004-07-28 14:00
I'm not positive that this is a bug.  The buit-in
compile function acepts unicode with non-ascii text in
literals:

>>> text = u"print u'''\u0442\u0435\u0441\u0442'''"
>>> exec compile(text, 's', 'exec')
тест
>>> import compiler
>>> exec compiler.compile(text, 's', 'exec')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File
"/usr/local/python/2.3.4/lib/python2.3/compiler/pycodegen.py",
line 64, in compile
    gen.compile()
  File
"/usr/local/python/2.3.4/lib/python2.3/compiler/pycodegen.py",
line 111, in compile
    tree = self._get_tree()
  File
"/usr/local/python/2.3.4/lib/python2.3/compiler/pycodegen.py",
line 77, in _get_tree
    tree = parse(self.source, self.mode)
  File
"/usr/local/python/2.3.4/lib/python2.3/compiler/transformer.py",
line 50, in parse
    return Transformer().parsesuite(buf)
  File
"/usr/local/python/2.3.4/lib/python2.3/compiler/transformer.py",
line 120, in parsesuite
    return self.transform(parser.suite(text))
UnicodeEncodeError: 'ascii' codec can't encode
characters in position 10-13: ordinal not in range(128)
>>> 
msg21836 - (view) Author: Jim Fulton (dcjim) (Python triager) Date: 2004-07-28 14:02
Logged In: YES 
user_id=73023

Also in 2.3
msg21837 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2004-07-29 11:19
Logged In: YES 
user_id=6656

the immediate problem is that the parser module does support 
unicode:

>>> import parser
>>> parser.suite(u"print u'''\u0442\u0435\u0441\u0442'''")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in 
position 10-13: ordinal not in range(128)

there may well be more bugs lurking in Lib/compiler wrt this 
issue, but this is the first... I don't know how easy this will be to 
fix (looking at what the builtin compile() function does with 
unicode might be a good start).
msg21838 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2004-07-29 11:30
Logged In: YES 
user_id=6656

thinking about this a little harder, doing a proper job probably 
invloves mucking around in the depths of python to support 
source-as-unicode throughout.  the vile solution is this sort of 
thing:

>>> parser.suite('# coding: utf-8\n' + u"print 
u'''\u0442\u0435\u0441\u0442'''".encode('utf-8'))
<parser.st object at 0x107770>
msg21839 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-07-29 11:38
Logged In: YES 
user_id=38388

Note that the tokenizer converts the input string into UTF-8
(transcoding it as necessary if a source code encoding shebang
is found) and the compiler will assume this encoding when
creating
Unicode literals.

I'm not sure whether the compiler package is up-to-date w/r to
these internal changes in the C-based compiler.
msg21840 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-02-25 22:00
Logged In: YES 
user_id=33168

FYI
msg114368 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-19 15:38
The compiler package has been removed from py3k.
History
Date User Action Args
2010-08-19 15:38:49BreamoreBoysetstatus: open -> closed

nosy: + BreamoreBoy
messages: + msg114368

resolution: out of date
2009-02-07 01:00:38naschemesetassignee: jhylton -> nascheme
nosy: + nascheme
2004-07-28 14:00:11dcjimcreate