Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shlex.split() converts unicode input to UCS-4 output #51237

Closed
fenner mannequin opened this issue Sep 24, 2009 · 10 comments
Closed

shlex.split() converts unicode input to UCS-4 output #51237

fenner mannequin opened this issue Sep 24, 2009 · 10 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@fenner
Copy link
Mannequin

fenner mannequin commented Sep 24, 2009

BPO 6988
Nosy @malemburg, @terryjreedy, @amauryfa, @pitrou, @ezio-melotti, @merwok

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-08-03.22:08:52.810>
created_at = <Date 2009-09-24.13:54:07.279>
labels = ['type-feature', 'library']
title = 'shlex.split() converts unicode input to UCS-4 output'
updated_at = <Date 2011-10-22.23:12:11.806>
user = 'https://bugs.python.org/fenner'

bugs.python.org fields:

activity = <Date 2011-10-22.23:12:11.806>
actor = 'eric.araujo'
assignee = 'none'
closed = True
closed_date = <Date 2010-08-03.22:08:52.810>
closer = 'terry.reedy'
components = ['Library (Lib)']
creation = <Date 2009-09-24.13:54:07.279>
creator = 'fenner'
dependencies = []
files = []
hgrepos = []
issue_num = 6988
keywords = []
message_count = 10.0
messages = ['93074', '93075', '93079', '93080', '93082', '93083', '93084', '93085', '112705', '146200']
nosy_count = 7.0
nosy_names = ['lemburg', 'terry.reedy', 'amaury.forgeotdarc', 'pitrou', 'fenner', 'ezio.melotti', 'eric.araujo']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue6988'
versions = ['Python 2.6', 'Python 2.5']

@fenner
Copy link
Mannequin Author

fenner mannequin commented Sep 24, 2009

In python 2.5, shlex handled unicode input fine:

Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51) 
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World!' )
['Hello,', 'World!']

In python 2.6, shlex turns unicode input into UCS-4 output, thus utterly
confusing execl:

Python 2.6 (r26:66714, Jun  8 2009, 16:07:29)
[GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World' )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']

Even weirder, the two return strings have different byte order (see
'H\x00\x00\x00' vs. '\x00\x00\x00W'!)

@fenner fenner mannequin added the stdlib Python modules in the Lib dir label Sep 24, 2009
@fenner
Copy link
Mannequin Author

fenner mannequin commented Sep 24, 2009

A colleague pointed out that the bad behavior was introduced in 2.5.2:

Python 2.5.2 (r252:60911, Sep 30 2008, 15:42:03) 
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u"Hello, World!" )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00']

@amauryfa
Copy link
Member

I'll take the opposite point of view:
the bad behavior was introduced with 2.5.1 (bpo-1548891, r52302), and
reverted for 2.5.2 because "it broke backwards compatibility with
arbitrary read buffers" (bpo-1730114, r53831)

The difference is in cStringIO:

>>> from cStringIO import StringIO
>>> StringIO(u"Hello, World!").read()
'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00
\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00'

The byte order is not different in the two strings: but u" " becomes
" \x00\x00\x00" and the three zeros are copied into the second item.

@fenner
Copy link
Mannequin Author

fenner mannequin commented Sep 24, 2009

so, just to be clear, your position is that the output of shlex.split(
u'Hello, World!' ) is *supposed* to be
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']?

@pitrou
Copy link
Member

pitrou commented Sep 24, 2009

Hm, while the StringIO behaviour supposedly cannot be changed for
backwards-compatibility reasons, we can probably improve shlex behaviour
with unicode strings.

@amauryfa
Copy link
Member

(Presented this way, "my opinion" becomes difficult to stand...
OTOH the docs say that the module does not support Unicode, so it's not
strictly a bug)
http://docs.python.org/library/shlex.html

Yes, shlex could be improved and encode unicode strings to ascii.

@malemburg
Copy link
Member

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

(Presented this way, "my opinion" becomes difficult to stand...
OTOH the docs say that the module does not support Unicode, so it's not
strictly a bug)
http://docs.python.org/library/shlex.html

Yes, shlex could be improved and encode unicode strings to ascii.

I'd suggest to convert Unicode input to a string using an
optional encoding parameter which defaults to 'utf-8' (most
shells nowadays default to UTF-8).

This is only a compromise, though, albeit a practical one.
POSIX has the notion of a portable character set:

http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tagtcjh_3

which is pretty much the same as ASCII. Any ASCII compatible
encoding is then allowed via variable length encodings (see
further down on that page).

@malemburg malemburg changed the title shlex.split() converts unicode input to UCS-4 output with varying byte order shlex.split() converts unicode input to UCS-4 output with varying byte order Sep 24, 2009
@fenner
Copy link
Mannequin Author

fenner mannequin commented Sep 24, 2009

Sorry, I didn't read the web documentation, only the module
documentation, which doesn't mention Unicode. I'd agree that since it's
a documented behavior, this bug can become:

  • an RFE for shlex to handle Unicode
  • meanwhile, if there will be any releases before that happens, an RFE
    for the module documentation to mention the lack of Unicode support

@fenner fenner mannequin changed the title shlex.split() converts unicode input to UCS-4 output with varying byte order shlex.split() converts unicode input to UCS-4 output Sep 24, 2009
@terryjreedy
Copy link
Member

The discussion pretty much says this was a feature request, which is obsolete for 2.x. Not an issue for 3.x:
>>> import shlex
>>> shlex.split('Hello, World!' )
['Hello,', 'World!']

@terryjreedy terryjreedy added the type-feature A feature request or enhancement label Aug 3, 2010
@merwok
Copy link
Member

merwok commented Oct 22, 2011

$ ./python 
Python 2.7.2+ (2.7:27ae7d4e1983+, Oct 23 2011, 00:09:06) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split(u'Hello, World!')
['Hello,', 'World!']

This was fixed indirectly by a StringIO fix in 27ae7d4e1983, for bpo-1548891.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants