Issue6988
Created on 2009-09-24 13:54 by fenner, last changed 2009-09-25 05:41 by ezio.melotti.
|
msg93074 - (view) |
Author: Bill Fenner (fenner) |
Date: 2009-09-24 13:54 |
|
In python 2.5, shlex handled unicode input fine:
Python 2.5.1 (r251:54863, Jun 15 2008, 18:24:51)
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World!' )
['Hello,', 'World!']
In python 2.6, shlex turns unicode input into UCS-4 output, thus utterly
confusing execl:
Python 2.6 (r26:66714, Jun 8 2009, 16:07:29)
[GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u'Hello, World' )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']
Even weirder, the two return strings have different byte order (see
'H\x00\x00\x00' vs. '\x00\x00\x00W'!)
|
|
msg93075 - (view) |
Author: Bill Fenner (fenner) |
Date: 2009-09-24 14:00 |
|
A colleague pointed out that the bad behavior was introduced in 2.5.2:
Python 2.5.2 (r252:60911, Sep 30 2008, 15:42:03)
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split( u"Hello, World!" )
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00']
|
|
msg93079 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) |
Date: 2009-09-24 16:12 |
|
I'll take the opposite point of view:
the bad behavior was introduced with 2.5.1 (issue1548891, r52302), and
reverted for 2.5.2 because "it broke backwards compatibility with
arbitrary read buffers" (issue1730114, r53831)
The difference is in cStringIO:
>>> from cStringIO import StringIO
>>> StringIO(u"Hello, World!").read()
'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00
\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\x00\x00'
The byte order is not different in the two strings: but u" " becomes
" \x00\x00\x00" and the three zeros are copied into the second item.
|
|
msg93080 - (view) |
Author: Bill Fenner (fenner) |
Date: 2009-09-24 17:21 |
|
so, just to be clear, your position is that the output of shlex.split(
u'Hello, World!' ) is *supposed* to be
['H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\x00\x00',
'\x00\x00\x00W\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00']?
|
|
msg93082 - (view) |
Author: Antoine Pitrou (pitrou) |
Date: 2009-09-24 17:34 |
|
Hm, while the StringIO behaviour supposedly cannot be changed for
backwards-compatibility reasons, we can probably improve shlex behaviour
with unicode strings.
|
|
msg93083 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) |
Date: 2009-09-24 17:48 |
|
(Presented this way, "my opinion" becomes difficult to stand...
OTOH the docs say that the module does not support Unicode, so it's not
strictly a bug)
http://docs.python.org/library/shlex.html
Yes, shlex could be improved and encode unicode strings to ascii.
|
|
msg93084 - (view) |
Author: Marc-Andre Lemburg (lemburg) |
Date: 2009-09-24 18:17 |
|
Amaury Forgeot d'Arc wrote:
>
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
>
> (Presented this way, "my opinion" becomes difficult to stand...
> OTOH the docs say that the module does not support Unicode, so it's not
> strictly a bug)
> http://docs.python.org/library/shlex.html
>
> Yes, shlex could be improved and encode unicode strings to ascii.
I'd suggest to convert Unicode input to a string using an
optional encoding parameter which defaults to 'utf-8' (most
shells nowadays default to UTF-8).
This is only a compromise, though, albeit a practical one.
POSIX has the notion of a portable character set:
http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap06.html#tagtcjh_3
which is pretty much the same as ASCII. Any ASCII compatible
encoding is then allowed via variable length encodings (see
further down on that page).
|
|
msg93085 - (view) |
Author: Bill Fenner (fenner) |
Date: 2009-09-24 18:24 |
|
Sorry, I didn't read the web documentation, only the module
documentation, which doesn't mention Unicode. I'd agree that since it's
a documented behavior, this bug can become:
- an RFE for shlex to handle Unicode
- meanwhile, if there will be any releases before that happens, an RFE
for the module documentation to mention the lack of Unicode support
|
|
| Date |
User |
Action |
Args |
| 2009-09-25 05:41:20 | ezio.melotti | set | priority: normal nosy:
+ ezio.melotti
|
| 2009-09-24 18:24:57 | fenner | set | messages:
+ msg93085 title: shlex.split() converts unicode input to UCS-4 output with varying byte order -> shlex.split() converts unicode input to UCS-4 output |
| 2009-09-24 18:17:58 | lemburg | set | nosy:
+ lemburg title: shlex.split() converts unicode input to UCS-4 output with varying byte order -> shlex.split() converts unicode input to UCS-4 output with varying byte order messages:
+ msg93084
|
| 2009-09-24 17:48:47 | amaury.forgeotdarc | set | resolution: wont fix -> messages:
+ msg93083 |
| 2009-09-24 17:34:39 | pitrou | set | nosy:
+ pitrou messages:
+ msg93082
|
| 2009-09-24 17:21:18 | fenner | set | status: pending -> open
messages:
+ msg93080 |
| 2009-09-24 16:12:15 | amaury.forgeotdarc | set | status: open -> pending
nosy:
+ amaury.forgeotdarc messages:
+ msg93079
resolution: wont fix |
| 2009-09-24 14:00:34 | fenner | set | messages:
+ msg93075 versions:
+ Python 2.5 |
| 2009-09-24 13:54:07 | fenner | create | |
|