New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shlex.split() converts unicode input to UCS-4 output #51237
Comments
In python 2.5, shlex handled unicode input fine: Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51)
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World!' )
['Hello,', 'World!'] In python 2.6, shlex turns unicode input into UCS-4 output, thus utterly Python 2.6 (r26:66714, Jun 8 2009, 16:07:29)
[GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World' )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'] Even weirder, the two return strings have different byte order (see |
A colleague pointed out that the bad behavior was introduced in 2.5.2: Python 2.5.2 (r252:60911, Sep 30 2008, 15:42:03)
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u"Hello, World!" )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00'] |
I'll take the opposite point of view: The difference is in cStringIO: >>> from cStringIO import StringIO
>>> StringIO(u"Hello, World!").read()
'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00
\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00' The byte order is not different in the two strings: but u" " becomes |
so, just to be clear, your position is that the output of shlex.split( |
Hm, while the StringIO behaviour supposedly cannot be changed for |
(Presented this way, "my opinion" becomes difficult to stand... Yes, shlex could be improved and encode unicode strings to ascii. |
Amaury Forgeot d'Arc wrote:
I'd suggest to convert Unicode input to a string using an This is only a compromise, though, albeit a practical one. http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tagtcjh_3 which is pretty much the same as ASCII. Any ASCII compatible |
Sorry, I didn't read the web documentation, only the module
|
The discussion pretty much says this was a feature request, which is obsolete for 2.x. Not an issue for 3.x:
>>> import shlex
>>> shlex.split('Hello, World!' )
['Hello,', 'World!'] |
$ ./python
Python 2.7.2+ (2.7:27ae7d4e1983+, Oct 23 2011, 00:09:06)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split(u'Hello, World!')
['Hello,', 'World!'] This was fixed indirectly by a StringIO fix in 27ae7d4e1983, for bpo-1548891. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: