classification
Title: shlex (or perhaps cStringIO) and unicode strings
Type: Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: drylock, georg.brandl, pitrou, python-dev
Priority: normal Keywords: patch

Created on 2006-08-29 21:16 by drylock, last changed 2011-10-23 02:38 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
cio.patch pitrou, 2011-10-21 20:35
Messages (9)
msg29709 - (view) Author: Erwin S. Andreasen (drylock) Date: 2006-08-29 21:16
Python 2.5c1 (r25c1:51305, Aug 19 2006, 18:23:29) 
[GCC 4.1.2 20060814 (prerelease) (Debian 4.1.1-11)] on
linux2

(Also seen in 2.4)

shlex.split do not like unicode strings:

>>> shlex.split(u"foo")
['f\x00\x00\x00o\x00\x00\x00o\x00\x00\x00']

The shlex code IMO suggests that it should accept
unicode (as it checks for argument being an instance of
basestring).

Digging slightly into this, this seems to be a
difference between StringIO and cStringIO. While
cStringIO claims it accepts unicode as long as it
encode too ASCII it gives invalid results:

>>> sys.getdefaultencoding()
'ascii'


>>> cStringIO.StringIO(u'foo').getvalue()
'f\x00\x00\x00o\x00\x00\x00o\x00\x00\x00'

Perhaps cStringIO should .encode to ASCII encoding
before consuming the input, as I can't imagine anyone
cares about the above result (which I guess are the
UCS-2 or UCS-4 characters).

msg29710 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-10-12 09:47
Logged In: YES 
user_id=849994

Thanks for your report, this is now fixed in rev. 52301,
52302 (2.5).
msg146126 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-21 20:12
Still happens on latest 2.7:

>>> from cStringIO import StringIO
>>> sio = StringIO(u"abc")
>>> sio.getvalue()
'a\x00b\x00c\x00'
msg146128 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-21 20:22
And unsurprisingly so, since the fix was reverted in r56830 by Georg.
msg146132 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-21 20:35
Georg, is this patch ok to you?
msg146162 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2011-10-22 10:17
If you think it's fine to change this behavior, then yes :)
msg146184 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-10-22 19:31
New changeset 27ae7d4e1983 by Antoine Pitrou in branch '2.7':
Issue #1548891: The cStringIO.StringIO() constructor now encodes unicode
http://hg.python.org/cpython/rev/27ae7d4e1983
msg146185 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-10-22 19:32
> If you think it's fine to change this behavior, then yes :)

Well, the "documented" behaviour makes no sense.
Either it should raise TypeError or convert. Since write() converts, it's logical for the constructor to do so as well.
msg146217 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-10-23 02:38
New changeset 0b39f2486314 by Éric Araujo in branch '2.7':
Note that the #1548891 fix indirectly fixes shlex (#6988, #1170)
http://hg.python.org/cpython/rev/0b39f2486314
History
Date User Action Args
2012-04-27 11:47:08pitroulinkissue2387 superseder
2011-10-23 02:38:02python-devsetmessages: + msg146217
2011-10-22 19:32:33pitrousetstatus: open -> closed
resolution: fixed
messages: + msg146185

stage: needs patch -> resolved
2011-10-22 19:31:22python-devsetnosy: + python-dev
messages: + msg146184
2011-10-22 10:17:31georg.brandlsetmessages: + msg146162
2011-10-21 20:35:18pitrousetfiles: + cio.patch
assignee: georg.brandl ->
messages: + msg146132

keywords: + patch
2011-10-21 20:22:05pitrousetmessages: + msg146128
2011-10-21 20:12:19pitrousetstatus: closed -> open

versions: + Python 2.7, - Python 2.5
nosy: + pitrou

messages: + msg146126
resolution: fixed -> (no value)
stage: needs patch
2006-08-29 21:16:22drylockcreate