classification
Title: shlex.split() converts unicode input to UCS-4 output
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 2.6, Python 2.5
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, eric.araujo, ezio.melotti, fenner, lemburg, pitrou, terry.reedy
Priority: normal Keywords:

Created on 2009-09-24 13:54 by fenner, last changed 2011-10-22 23:12 by eric.araujo. This issue is now closed.

Messages (10)
msg93074 - (view) Author: Bill Fenner (fenner) Date: 2009-09-24 13:54
In python 2.5, shlex handled unicode input fine:

Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51) 
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World!' )
['Hello,', 'World!']

In python 2.6, shlex turns unicode input into UCS-4 output, thus utterly
confusing execl:

Python 2.6 (r26:66714, Jun  8 2009, 16:07:29)
[GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World' )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']

Even weirder, the two return strings have different byte order (see
'H\x00\x00\x00' vs. '\x00\x00\x00W'!)
msg93075 - (view) Author: Bill Fenner (fenner) Date: 2009-09-24 14:00
A colleague pointed out that the bad behavior was introduced in 2.5.2:

Python 2.5.2 (r252:60911, Sep 30 2008, 15:42:03) 
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u"Hello, World!" )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00']
msg93079 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-09-24 16:12
I'll take the opposite point of view:
the bad behavior was introduced with 2.5.1 (issue1548891, r52302), and
reverted for 2.5.2 because "it broke backwards compatibility with
arbitrary read buffers" (issue1730114, r53831)

The difference is in cStringIO:

>>> from cStringIO import StringIO
>>> StringIO(u"Hello, World!").read()
'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00
\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00'

The byte order is not different in the two strings: but u" " becomes 
" \x00\x00\x00" and the three zeros are copied into the second item.
msg93080 - (view) Author: Bill Fenner (fenner) Date: 2009-09-24 17:21
so, just to be clear, your position is that the output of shlex.split(
u'Hello, World!' ) is *supposed* to be
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']?
msg93082 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-09-24 17:34
Hm, while the StringIO behaviour supposedly cannot be changed for
backwards-compatibility reasons, we can probably improve shlex behaviour
with unicode strings.
msg93083 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-09-24 17:48
(Presented this way, "my opinion" becomes difficult to stand...
OTOH the docs say that the module does not support Unicode, so it's not
strictly a bug)
http://docs.python.org/library/shlex.html

Yes, shlex could be improved and encode unicode strings to ascii.
msg93084 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-09-24 18:17
Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
> (Presented this way, "my opinion" becomes difficult to stand...
> OTOH the docs say that the module does not support Unicode, so it's not
> strictly a bug)
> http://docs.python.org/library/shlex.html
> 
> Yes, shlex could be improved and encode unicode strings to ascii.

I'd suggest to convert Unicode input to a string using an
optional encoding parameter which defaults to 'utf-8' (most
shells nowadays default to UTF-8).

This is only a compromise, though, albeit a practical one.
POSIX has the notion of a portable character set:

http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tagtcjh_3

which is pretty much the same as ASCII. Any ASCII compatible
encoding is then allowed via variable length encodings (see
further down on that page).
msg93085 - (view) Author: Bill Fenner (fenner) Date: 2009-09-24 18:24
Sorry, I didn't read the web documentation, only the module
documentation, which doesn't mention Unicode.  I'd agree that since it's
a documented behavior, this bug can become:

- an RFE for shlex to handle Unicode
- meanwhile, if there will be any releases before that happens, an RFE
for the module documentation to mention the lack of Unicode support
msg112705 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-08-03 22:08
The discussion pretty much says this was a feature request, which is obsolete for 2.x. Not an issue for 3.x:
>>> import shlex
>>> shlex.split('Hello, World!' )
['Hello,', 'World!']
msg146200 - (view) Author: √Čric Araujo (eric.araujo) * (Python committer) Date: 2011-10-22 23:12
$ ./python 
Python 2.7.2+ (2.7:27ae7d4e1983+, Oct 23 2011, 00:09:06) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split(u'Hello, World!')
['Hello,', 'World!']

This was fixed indirectly by a StringIO fix in 27ae7d4e1983, for #1548891.
History
Date User Action Args
2011-10-22 23:12:11eric.araujosetnosy: + eric.araujo
messages: + msg146200
2010-08-03 22:08:52terry.reedysetstatus: open -> closed

nosy: + terry.reedy
messages: + msg112705

type: enhancement
resolution: out of date
2009-09-25 05:41:20ezio.melottisetpriority: normal
nosy: + ezio.melotti
2009-09-24 18:24:57fennersetmessages: + msg93085
title: shlex.split() converts unicode input to UCS-4 output with varying byte order -> shlex.split() converts unicode input to UCS-4 output
2009-09-24 18:17:58lemburgsetnosy: + lemburg
title: shlex.split() converts unicode input to UCS-4 output with varying byte order -> shlex.split() converts unicode input to UCS-4 output with varying byte order
messages: + msg93084
2009-09-24 17:48:47amaury.forgeotdarcsetresolution: wont fix -> (no value)
messages: + msg93083
2009-09-24 17:34:39pitrousetnosy: + pitrou
messages: + msg93082
2009-09-24 17:21:18fennersetstatus: pending -> open

messages: + msg93080
2009-09-24 16:12:15amaury.forgeotdarcsetstatus: open -> pending

nosy: + amaury.forgeotdarc
messages: + msg93079

resolution: wont fix
2009-09-24 14:00:34fennersetmessages: + msg93075
versions: + Python 2.5
2009-09-24 13:54:07fennercreate