shlex.split() converts unicode input to UCS-4 output #51237

fenner · 2009-09-24T13:54:07Z

BPO	6988
Nosy	@malemburg, @terryjreedy, @amauryfa, @pitrou, @ezio-melotti, @merwok

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-08-03.22:08:52.810>
created_at = <Date 2009-09-24.13:54:07.279>
labels = ['type-feature', 'library']
title = 'shlex.split() converts unicode input to UCS-4 output'
updated_at = <Date 2011-10-22.23:12:11.806>
user = 'https://bugs.python.org/fenner'

bugs.python.org fields:

activity = <Date 2011-10-22.23:12:11.806>
actor = 'eric.araujo'
assignee = 'none'
closed = True
closed_date = <Date 2010-08-03.22:08:52.810>
closer = 'terry.reedy'
components = ['Library (Lib)']
creation = <Date 2009-09-24.13:54:07.279>
creator = 'fenner'
dependencies = []
files = []
hgrepos = []
issue_num = 6988
keywords = []
message_count = 10.0
messages = ['93074', '93075', '93079', '93080', '93082', '93083', '93084', '93085', '112705', '146200']
nosy_count = 7.0
nosy_names = ['lemburg', 'terry.reedy', 'amaury.forgeotdarc', 'pitrou', 'fenner', 'ezio.melotti', 'eric.araujo']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue6988'
versions = ['Python 2.6', 'Python 2.5']

fenner · 2009-09-24T13:54:06Z

In python 2.5, shlex handled unicode input fine:

Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51) 
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World!' )
['Hello,', 'World!']

In python 2.6, shlex turns unicode input into UCS-4 output, thus utterly
confusing execl:

Python 2.6 (r26:66714, Jun  8 2009, 16:07:29)
[GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World' )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']

Even weirder, the two return strings have different byte order (see
'H\x00\x00\x00' vs. '\x00\x00\x00W'!)

fenner · 2009-09-24T14:00:34Z

A colleague pointed out that the bad behavior was introduced in 2.5.2:

Python 2.5.2 (r252:60911, Sep 30 2008, 15:42:03) 
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u"Hello, World!" )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00']

amauryfa · 2009-09-24T16:12:15Z

I'll take the opposite point of view:
the bad behavior was introduced with 2.5.1 (bpo-1548891, r52302), and
reverted for 2.5.2 because "it broke backwards compatibility with
arbitrary read buffers" (bpo-1730114, r53831)

The difference is in cStringIO:

>>> from cStringIO import StringIO
>>> StringIO(u"Hello, World!").read()
'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00
\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00'

The byte order is not different in the two strings: but u" " becomes
" \x00\x00\x00" and the three zeros are copied into the second item.

fenner · 2009-09-24T17:21:18Z

so, just to be clear, your position is that the output of shlex.split(
u'Hello, World!' ) is *supposed* to be
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']?

pitrou · 2009-09-24T17:34:40Z

Hm, while the StringIO behaviour supposedly cannot be changed for
backwards-compatibility reasons, we can probably improve shlex behaviour
with unicode strings.

amauryfa · 2009-09-24T17:48:47Z

(Presented this way, "my opinion" becomes difficult to stand...
OTOH the docs say that the module does not support Unicode, so it's not
strictly a bug)
http://docs.python.org/library/shlex.html

Yes, shlex could be improved and encode unicode strings to ascii.

malemburg · 2009-09-24T18:17:58Z

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

(Presented this way, "my opinion" becomes difficult to stand...
OTOH the docs say that the module does not support Unicode, so it's not
strictly a bug)
http://docs.python.org/library/shlex.html

Yes, shlex could be improved and encode unicode strings to ascii.

I'd suggest to convert Unicode input to a string using an
optional encoding parameter which defaults to 'utf-8' (most
shells nowadays default to UTF-8).

This is only a compromise, though, albeit a practical one.
POSIX has the notion of a portable character set:

http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tagtcjh_3

which is pretty much the same as ASCII. Any ASCII compatible
encoding is then allowed via variable length encodings (see
further down on that page).

fenner · 2009-09-24T18:24:58Z

Sorry, I didn't read the web documentation, only the module
documentation, which doesn't mention Unicode. I'd agree that since it's
a documented behavior, this bug can become:

an RFE for shlex to handle Unicode
meanwhile, if there will be any releases before that happens, an RFE
for the module documentation to mention the lack of Unicode support

terryjreedy · 2010-08-03T22:08:53Z

The discussion pretty much says this was a feature request, which is obsolete for 2.x. Not an issue for 3.x:
>>> import shlex
>>> shlex.split('Hello, World!' )
['Hello,', 'World!']

merwok · 2011-10-22T23:12:12Z

$ ./python 
Python 2.7.2+ (2.7:27ae7d4e1983+, Oct 23 2011, 00:09:06) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split(u'Hello, World!')
['Hello,', 'World!']

This was fixed indirectly by a StringIO fix in 27ae7d4e1983, for bpo-1548891.

fenner mannequin added the stdlib Python modules in the Lib dir label Sep 24, 2009

malemburg changed the title ~~shlex.split() converts unicode input to UCS-4 output with varying byte order~~ shlex.split() converts unicode input to UCS-4 output with varying byte order Sep 24, 2009

fenner mannequin changed the title ~~shlex.split() converts unicode input to UCS-4 output with varying byte order~~ shlex.split() converts unicode input to UCS-4 output Sep 24, 2009

terryjreedy closed this as completed Aug 3, 2010

terryjreedy added the type-feature A feature request or enhancement label Aug 3, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shlex.split() converts unicode input to UCS-4 output #51237

shlex.split() converts unicode input to UCS-4 output #51237

fenner mannequin commented Sep 24, 2009

fenner mannequin commented Sep 24, 2009

fenner mannequin commented Sep 24, 2009

amauryfa commented Sep 24, 2009

fenner mannequin commented Sep 24, 2009

pitrou commented Sep 24, 2009

amauryfa commented Sep 24, 2009

malemburg commented Sep 24, 2009

fenner mannequin commented Sep 24, 2009

terryjreedy commented Aug 3, 2010

merwok commented Oct 22, 2011

shlex.split() converts unicode input to UCS-4 output #51237

shlex.split() converts unicode input to UCS-4 output #51237

Comments

fenner mannequin commented Sep 24, 2009

fenner mannequin commented Sep 24, 2009

fenner mannequin commented Sep 24, 2009

amauryfa commented Sep 24, 2009

fenner mannequin commented Sep 24, 2009

pitrou commented Sep 24, 2009

amauryfa commented Sep 24, 2009

malemburg commented Sep 24, 2009

fenner mannequin commented Sep 24, 2009

terryjreedy commented Aug 3, 2010

merwok commented Oct 22, 2011