shlex have problems with parsing unicode #45511

dexen · 2007-09-17T15:53:57Z

BPO	1170
Nosy	@loewis, @rhettinger, @mdickinson, @abalkin, @orsenthil, @vstinner, @benjaminp, @mcepl, @ezio-melotti, @merwok, @dhellmann, @bitdancer, @jewettaij
Dependencies	bpo-10587: Document the meaning of str methods
Files	shlex-unicode.patch issue1170.diff: py3k patch shlex_wt.py: shlex augmented with "wordterminators" member

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-09-20.07:18:12.357>
created_at = <Date 2007-09-17.15:53:56.721>
labels = ['type-bug', 'docs']
title = 'shlex have problems with parsing unicode'
updated_at = <Date 2021-09-29.07:33:59.402>
user = 'https://bugs.python.org/dexen'

bugs.python.org fields:

activity = <Date 2021-09-29.07:33:59.402>
actor = 'jewett-aij'
assignee = 'none'
closed = True
closed_date = <Date 2021-09-20.07:18:12.357>
closer = 'vstinner'
components = ['Documentation']
creation = <Date 2007-09-17.15:53:56.721>
creator = 'dexen'
dependencies = ['10587']
files = ['9025', '18224', '23161']
hgrepos = []
issue_num = 1170
keywords = ['patch']
message_count = 51.0
messages = ['55967', '55968', '55969', '58964', '77972', '77973', '91853', '106424', '109292', '110854', '110856', '111594', '111709', '111710', '111716', '111717', '111718', '111719', '111751', '111755', '111756', '112911', '122919', '122929', '140227', '140246', '140262', '140516', '140518', '140529', '140586', '140590', '140591', '140592', '144074', '144076', '144077', '144080', '144097', '144193', '144201', '144202', '146219', '146220', '348612', '402187', '402200', '402821', '402822', '402824', '402825']
nosy_count = 19.0
nosy_names = ['loewis', 'rhettinger', 'mark.dickinson', 'belopolsky', 'orsenthil', 'vstinner', 'dexen', 'benjamin.peterson', 'cgwalters', 'mcepl', 'ezio.melotti', 'eric.araujo', 'doughellmann', 'r.david.murray', 'nwerneck', 'fperez', 'Santiago.Romero', 'wombat', 'jewett-aij']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue1170'
versions = ['Python 3.5']

dexen · 2007-09-17T16:17:13Z

Feeding unicode to shlex object created in POSIX compat mode causes
UnicodeDecodeError to be raised. It appears that shlex object defines
sting .wordchars, containing latin-1 (iso8859-1) encoded characters
with charcodes >=128, which is used to check whether a character from
input constitues a word character or not.

dexen · 2007-09-17T16:20:38Z

A quick paste to illustrate: the exception is raised only when unicode
object is passed to shlex. Warning: the cStringIO module is unsuitable
for use there, only the StringIO. cStringIO does not output unicode.

dexen!muraena!~$ python
Python 2.5.1 (r251:54863, May  4 2007, 16:52:23)
[GCC 4.1.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from StringIO import StringIO
>>> import shlex
>>> lx = shlex.shlex( StringIO( unicode( "abc" ) ) )
>>> lx.get_token()
u'abc'
>>> lx = shlex.shlex( StringIO( unicode( "abc" ) ), None, True )
>>> lx.get_token()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/shlex.py", line 96, in get_token
    raw = self.read_token()
  File "/usr/lib/python2.5/shlex.py", line 150, in read_token
    elif nextchar in self.wordchars:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 
63: ordinal not in range(128)
>>>

dexen · 2007-09-17T16:31:21Z

One remark to previous message:
the first time i created shlex object in non-POSIX mode (the default),
in later it's in POSIX mode (due to the third parameter to shlex being
True). The bug in question manifests only in POSIX mode.

BTW, that so-called POSIX mode would be more POSIX-ish, if instead of
comparing characters with a fixed, short list, would use the ctype()
function as found in standard C library. The functions takes current
locale (setable in process) into account when deciding what is leter,
whitespace, punctuation etc.

cgwalters · 2007-12-22T18:22:13Z

Patch to add Unicode support.

Note: this patch recodes shlex.py from iso-8859-1 to utf-8, so it has
mixed encodings.

nwerneck · 2008-12-17T16:40:57Z

Hello. I tried to patch my own shlex, and this doens't seem to be
working properly. When I try the patched module isntead of th eoriginal,
in my otherwise working program, I get the result ahead.

Is there any conversion steps missing?...

mymachine$ python interp.py < exemplo.prg
Traceback (most recent call last):
  File "interp.py", line 11, in <module>
    tok = ss.get_token()
  File "shlexutf.py", line 103, in get_token
    raw = self.read_token()
  File "shlexutf.py", line 139, in read_token
    nextcategory = unicodedata.category(nextchar)
TypeError: category() argument 1 must be unicode, not str

nwerneck · 2008-12-17T17:12:21Z

OK, it worked after I found out I didn't know how to open unicode
files... Sorry for the noise, and thanks for this patch! :]

benjaminp · 2009-08-22T06:13:06Z

The patch needs tests before it can be applied. Additionally, I'm not
sure if having a "utf" option is helpful. Is there a reason not to have
unicode support by default?

merwok · 2010-05-25T10:35:48Z

shlex in 3.x works with Unicode strings. Is it still time to try to fix this bug for 2.7?

bitdancer · 2010-07-05T03:38:59Z

shlex may use unicode in py3k, but since the file still starts with a latin-1 coding cookie and the posix logic hasn't been changed, I suspect that it does not work correctly (ie: does not correctly identify word characters, per msg55969).

It's too late for 2.7 I think, but it seems there is work still to do in py3k.

abalkin · 2010-07-20T03:00:15Z

As discussed in msg110828 under bpo-9308, it is not clear whether logic identifying word characters in shlex is correct in presence of unicode.

abalkin · 2010-07-20T03:19:34Z

I believe the e-mail thread that culminated in r32284, "Implemented posix-mode parsing support in shlex.py", was "shellwords" from April 2003:
http://mail.python.org/pipermail/python-dev/2003-April/034670.html

I scanned through the messages, but could not find a reference to the standard that was implemented.

fperez · 2010-07-26T04:35:41Z

Here is an illustration of the problem with a simple test case (the value of the posix flag doesn't make any difference):

Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> list(shlex.shlex('ab'))
['ab']
>>> list(shlex.shlex(u'ab', posix=True))
['a', '\x00', '\x00', '\x00', 'b', '\x00', '\x00', '\x00']
>>> list(shlex.shlex(u'ab', posix=False))
['a', '\x00', '\x00', '\x00', 'b', '\x00', '\x00', '\x00']
>>>

abalkin · 2010-07-27T18:24:09Z

Fernando,

Is this 2.7 only problem? In 3.2

>>> list(shlex.shlex('ab'))
['ab']

and bytes are not supported.

>> list(shlex.shlex(b'ab'))
Traceback (most recent call last):
..
AttributeError: 'bytes' object has no attribute 'read'

It is debatable whether either is a bug.

fperez · 2010-07-27T18:26:51Z

Yes, sorry that I failed to mention the example I gave applies only to 2.x, not to 3.x.

abalkin · 2010-07-27T18:52:43Z

On Tue, Jul 27, 2010 at 2:26 PM, Fernando Perez <report@bugs.python.org> wrote:
..

Yes, sorry that I failed to mention the example I gave applies only to 2.x, not to 3.x.

Why do you expect shlex to work with unicode in 2.x? The
documentation clearly says that the argument should be a string.
Supporting unicode is not an unreasonable RFE, but won't be considered
for 2.x anymore.

What's your take on accepting bytes in 3.x?

fperez · 2010-07-27T19:02:37Z

On Tue, Jul 27, 2010 at 11:52, Alexander Belopolsky
<report@bugs.python.org> wrote:

Why do you expect shlex to work with unicode in 2.x? =A0The
documentation clearly says that the argument should be a string.
Supporting unicode is not an unreasonable RFE, but won't be considered
for 2.x anymore.

Well, I didn't make the original report, just provided a short,
illustrative example :) It's easy enough to work around the issue for
2.x that I don't care too much about it, so I have no problem with 2.x
staying as it is.

What's your take on accepting bytes in 3.x?

Mmh... Not too sure. I'd think about it from the perspective of what
possible sources of input could produce raw bytes, that would be
reasonable use cases for shlex. Is it common in 3.x to read a file in
bytes mode? If so, then it might be a good reason to have shlex parse
bytes as well, since I can imagine reading inputs from files to be
parsed via shlex.

But take my opinion on 3.x with a big grain of salt, I have very
little experience with it as of yet.

Cheers,

f

rhettinger · 2010-07-27T19:04:18Z

+1 on get shlex to work better with Unicode. The core concepts of this module are general purpose and applicable to all kinds of text.

abalkin · 2010-07-27T19:07:14Z

On Tue, Jul 27, 2010 at 3:04 PM, Raymond Hettinger
<report@bugs.python.org> wrote:

Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:

+1 on get shlex to work better with Unicode.

In 2.7.x? It more or less works in 3.x already.

bitdancer · 2010-07-28T00:32:32Z

Alexander: the "more or less" is on the "less" side when dealing with non-ASCII letters, I think. See my msg109292 and your own followups.

abalkin · 2010-07-28T00:55:47Z

David,

What do you think about attached patch? Would that be a change in the "more" direction?

abalkin · 2010-07-28T01:02:23Z

I am adding MvL to nosy.

Martin,

I believe you are the ultimate authority on how to tokenize a unicode stream.

abalkin · 2010-08-04T22:21:48Z

I don't like my patch anymore because it breaks code that manipulates public wordchars attribute. Users may want to set it to their own alphabet or append additional characters to the default list. Maybe wordchars should always be "non-posix" wordchars and iswordchar posix mode test be c.isalnum() or c in self.wordchars?

abalkin · 2010-11-30T17:09:05Z

Adding bpo-10587 because we need to figure out the exact meaning of str.isspace() etc. first. It is possible that for proper operation shlex should consult unicodedata directly.

loewis · 2010-11-30T19:00:30Z

The key requirement to consider for in POSIX compatible mode is, well, POSIX compatibility, which is defined in

http://www.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html
http://www.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_03

Now, POSIX declares that what <blank> is depends on LC_CTYPE (character class blank). I'd argue that if the objective is to behave exactly like the shell, it really should be doing that (i.e. work in a locale-aware manner).

SantiagoRomero · 2011-07-13T07:51:47Z

I think I'm suffering the same problem in some small programs that use shlex:

>> import shlex

>>> text = "python and shlex"
>>> shlex.split(text)
['python', 'and', 'shlex']

>>> text = u"python and shlex"
>>> shlex.split(text)
['p\x00\x00\x00y\x00\x00\x00t\x00\x00\x00h\x00\x00\x00o\x00\x00\x00n\x00\x00\x00', '\x00\x00\x00a\x00\x00\x00n\x00\x00\x00d\x00\x00\x00', '\x00\x00\x00s\x00\x00\x00h\x00\x00\x00l\x00\x00\x00e\x00\x00\x00x\x00\x00\x00']

I'm currently using the following "basic" workaround (while assuming that my strings have only ascii chars):

>>> [ x.replace("\0", "") for x in shlex.split(text) ]
['python', 'and', 'shlex']

It would be very nice if shlex could work with unicode strings ...

Thanks.

bitdancer · 2011-07-13T11:15:42Z

This isn't going to get fixed in 2.x (shlex doesn't support unicode in 2.x, and doing so would be a new feature). In 3.x all strings are unicode, so the problem you are seeing doesn't exist. This issue is about the broader problem of what counts as a word character when more than ASCII is involved.

dhellmann · 2011-07-17T15:59:37Z

Right. Any program that needs to parse command lines containing filenames or other arguments with unicode characters will encounter this problem.

merwok · 2011-07-18T14:25:43Z

We all recognize that ASCII is very much limited and that the real way to work with strings is Unicode. However, here our hands are tied by our development process: shlex in 2.x does not support Unicode, adding that support would be a new feature, and 2.7 is closed to new features. If shlex was supposed to support Unicode, then this would be a bug that could be fixed in 2.7, but it’s not. All we can do is improve the 2.7 doc to show how to work around that (splitting on bytes and then decoding each chunk, for example).

dhellmann · 2011-07-18T14:36:32Z

Is unicode supported by shlex in 3.x already? It's curious that unicode support is considered a new feature, rather than a bug. I understand wanting to allocate development resources carefully, though. If someone were to prepare a patch, would it even have a chance of being accepted in 2.7?

merwok · 2011-07-18T14:43:38Z

See http://bugs.python.org/issue1170#msg106424 and following.

merwok · 2011-07-18T14:51:59Z

It’s not about allocating resources, it’s about following process. The first part is that we don’t add new features to stable releases, the second item is that this is not considered a bug fix: The code pre-dates Unicode, was not updated to support it, and the docs say “The shlex module currently does not support Unicode input”.

wombat · 2011-09-15T11:25:12Z

Proposed solution and patch to follow. Please let me know if I am posting it in the wrong place.

The main problem with shlex is that the shlex interface is inadequate to handle unicode. Specifically it is no longer feasible to provide a list of every possible character that the user could want to appear within a token. Suppose the user wants the ability to parse words in simplified Chinese. If I understand correctly, then currently, they would have to set "self.wordchars" to a string (or some other container) of 6000 (unicode) characters, and this enormous string would need to be searched each time a new character is read. This was a problem with shlex from the beginning, but it became more acute when support for unicode was added. Generally, in some cases, it is much more convenient instead to specify a short list of characters you -don't- want to appear in a word (word delimiters), than to list all the characters you do.

An obvious (although perhaps not optimal) solution is to add an additional data member to shlex, consisting of the characters which terminate the reading of a token. (In other words, the set-inverse of wordchars.) In the attached example code, I call it "self.wordterminators". To remain backwards-compatible with shlex, self.wordterminators is empty by default. But if not-empty, self.wordterminators overrides self.wordchars.

I've been distributing a customized version of shlex with my own software which implements this modest change (shlex_wt). (See attachment.) It is otherwise identical to the version of shlex.py that ships with python 3.2.2. (It has been further modified only slightly to be compatible with both python 2.7 and python 3.) It's not beautiful code, but it seems to be a successful kluge for this particular issue. I don't know if it makes a worthy patch, but perhaps somebody out there finds it useful. To make it easy to spot the changes, each of the lines I changed ends in a comment "#WORDTERMINATORS". (There are only 15 of these lines.)
-Andrew Jewett

wombat · 2011-09-15T11:38:42Z

Not to get side-tracked, but on a related note, it would be nice if there was a python module which defined sets of unicode characters corresponding to different categories (similar to the categories listed here: http://www.fileformat.info/info/unicode/category/index.htm)
That way, for example, if the user wants to categorically ignore ALL mathematical symbols or punctuation marks, they could assign:

self.wordterminators = unicode_math + unicode_punctuation.
(The + means set union.)

If somebody tried to specify all of them manually, this would be painful. There are hundreds of punctuation symbols in unicode, for example. (I suppose most of the time, one does not need to be so thorough. This feature not really necessary for getting shlex to work. But I think this would be a easy feature to add.)

ezio-melotti · 2011-09-15T12:00:21Z

That can be done programmatically using the unicodedata module. The regex module (that will hopefully be include in 3.3) is also able to match characters that belongs to specific categories.

merwok · 2011-09-15T14:53:48Z

Andrew: Thanks for your contribution, but your patch cannot go into 2.7, as we don’t add new features in stable versions (re-read the whole thread if you need more info). This report is still open because we need a doc patch to explain how to work around that.

wombat · 2011-09-15T19:52:01Z

That can be done programmatically using the unicodedata module.
The regex module (that will hopefully be include in 3.3) is
also able to match characters that belongs to specific categories.

Ezio: Thanks. (New to me, actually) Is this what you mean?:
http://www.regular-expressions.info/unicode.html
For the purposes of patching shlex, should we use regex instead of sets of characters (or strings) to test for membership in shlex.wordterminators? (Or should we create a different class member? Unfortunately, I guess shlex.wordchars has to be left as some kind of container object to maintain backwards compatibility.)
Something like that would definitely solve the problem nicely.

Andrew: Thanks for your contribution, but your patch cannot
go into 2.7, as we don’t add new features in stable versions

Eric: That's fine. I just posted here because this page currently gets the top hit when searching for "shlex unicode". If you think it's appropriate to repost my message for python version 3.4, let me know. The issue with shlex.wordchars that I raised is valid for any version of python. I'm not sure my solution is optimal. (I like the regex idea).

merwok · 2011-09-17T15:23:45Z

Andrew: Ezio means http://docs.python.org/2.7/library/unicodedata

For the purposes of patching shlex
Sorry, but we are not talking about patching shlex.

I just posted here because this page currently gets the top hit
when searching for "shlex unicode".
It’s okay. A recipe on ActiveState and a “shlexu” module on PyPI would also be good things to have.

If you think it's appropriate to repost my message for python version 3.4,
let me know.
shlex supports Unicode in 3.x. If there is a bug, can you please open another bug report? This one is already too long, and I’d prefer to keep it focused on the need for a documentation patch.

bitdancer · 2011-09-17T16:07:18Z

Ezio, I don't see any indication in this ticket that this bug was actually *fixed* in 3.x. Unicode doesn't cause immediate errors in 3.x, but it isn't recognized as wordchars, etc. Am I missing something?

ezio-melotti · 2011-09-17T16:12:01Z

I haven't looked at the shlex code (yet), my comment was just about the idea of adding constants with chars that belong to different Unicode categories.

merwok · 2011-10-23T05:36:14Z

$ ./python 
Python 2.7.2+ (2.7:27ae7d4e1983+, Oct 23 2011, 00:09:06) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split(u'Hello, World!')
['Hello,', 'World!']

This bug was fixed indirectly by a StringIO fix in 27ae7d4e1983, for bpo-1548891. BTW, this report was a duplicate of bpo-6988, closed a year ago.

Python 2.7.3 will finally support unicode in shlex, so the doc change requested in this report is outdated. However, I still want to do something for this. I’ve noticed that shlex.split’s argument can be a file-like object, and I wonder if passing a StringIO.StringIO(my_unicode_string) wouldn’t work. If such a short recipe works, I’m all for including it in the 2.7 docs for users of older versions. If a longer recipe is needed, then ActiveState’s Python Cookbook would be more appropriate, and I’ll add a link to the docs. If it’s very long and requires a PyPI project, then I’m willing to link to that.

merwok · 2011-10-23T05:42:50Z

The second message in this page reports that StringIO.StringIO works, but when I pass a unicode string with non-ASCII chars there’s a method call that fails because of implicit unicode-to-str conversion:

Traceback (most recent call last):
  File "/usr/lib/python2.7/shlex.py", line 150, in read_token
    elif nextchar in self.wordchars:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 63: ordinal not in range(128)

I’ll try to create a Shlex instance, replace self.wordchars with a decoded version and try again.

vstinner · 2019-07-29T11:29:11Z

This issue is 12 years old has 3 patches: it's far from being "newcomer friendly", I remove the "Easy" label.

mcepl · 2021-09-19T23:33:22Z

I cannot reproduce it with the current 3.* version. Did anybody reproduce with 3.5?

Otherwise, I suggest close this, as a 2.* bug.

vstinner · 2021-09-20T07:18:12Z

This issue has been fixed in Python 3 by using Unicode rather than bytes in shlex. Python 2 users: it's time to upgrade to Python 3 ;-)

jewettaij · 2021-09-29T07:16:49Z

The error messages may have gone away, but the underlying unicode limitations I mentioned remain:

Suppose you wanted to use shlex to build a parser for Chinese text. Would you have to set "wordchars" to a string containing every possible Chinese character?

I myself wrote a parser for a crude language where words can contain any character except for whitespace and parenthesis. I needed a way to specify the characters which cannot belong to a word. (That's how I solved the problem. I modified shlex.py and added a "wordterminators" member. If "wordterminators" was left blank, then "wordchars" were used instead. This was a trivial change to "shlex.py" and it added a lot of functionality.)

I would like to suggest making this change (or something similar) to the official version of "shlex.py". Would sending an email to "python-ideas@python.org" be a good place to make this proposal?

vstinner · 2021-09-29T07:26:21Z

I would like to suggest making this change (or something similar) to the official version of "shlex.py". Would sending an email to "python-ideas@python.org" be a good place to make this proposal?

Yes, python-ideas is a good place to start discussion such idea. This issue is closed, if you discuss it here, you will get a limited audience.

jewettaij · 2021-09-29T07:32:02Z

After posting that, I noticed that the second example I listed in my previous post (a language where words contain any non-whitespace, non-parenthesis character) can now be implemented in the current version of shlex.py by setting "whitespace_true" and "punctuation". (Sorry, it's been a while since I looked at shlex.py, and it's had some usefl new features.)

jewettaij · 2021-09-29T07:33:59Z

Alright. I'll think about it a little more and post my suggestion there, perhaps. Thanks Victor.

dexen mannequin added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Sep 17, 2007

abalkin self-assigned this Jul 20, 2010

merwok added easy docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir topic-unicode labels Sep 2, 2011

abalkin removed their assignment Jun 30, 2014

vstinner removed the easy label Jul 29, 2019

vstinner closed this as completed Sep 20, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

shlex have problems with parsing unicode #45511

shlex have problems with parsing unicode #45511

Comments

dexen mannequin commented Sep 17, 2007

dexen mannequin commented Sep 17, 2007

dexen mannequin commented Sep 17, 2007

dexen mannequin commented Sep 17, 2007

cgwalters mannequin commented Dec 22, 2007

nwerneck mannequin commented Dec 17, 2008

nwerneck mannequin commented Dec 17, 2008

benjaminp commented Aug 22, 2009

merwok commented May 25, 2010

bitdancer commented Jul 5, 2010

abalkin commented Jul 20, 2010

abalkin commented Jul 20, 2010

fperez mannequin commented Jul 26, 2010

abalkin commented Jul 27, 2010

fperez mannequin commented Jul 27, 2010

abalkin commented Jul 27, 2010

fperez mannequin commented Jul 27, 2010

rhettinger commented Jul 27, 2010

abalkin commented Jul 27, 2010

bitdancer commented Jul 28, 2010

abalkin commented Jul 28, 2010

abalkin commented Jul 28, 2010

abalkin commented Aug 4, 2010

abalkin commented Nov 30, 2010

loewis mannequin commented Nov 30, 2010

SantiagoRomero mannequin commented Jul 13, 2011

bitdancer commented Jul 13, 2011

dhellmann commented Jul 17, 2011

merwok commented Jul 18, 2011

dhellmann commented Jul 18, 2011

merwok commented Jul 18, 2011

merwok commented Jul 18, 2011

wombat mannequin commented Sep 15, 2011

wombat mannequin commented Sep 15, 2011

ezio-melotti commented Sep 15, 2011

merwok commented Sep 15, 2011

wombat mannequin commented Sep 15, 2011

merwok commented Sep 17, 2011

bitdancer commented Sep 17, 2011

ezio-melotti commented Sep 17, 2011

merwok commented Oct 23, 2011

merwok commented Oct 23, 2011

vstinner commented Jul 29, 2019

mcepl mannequin commented Sep 19, 2021

vstinner commented Sep 20, 2021

jewettaij mannequin commented Sep 29, 2021

vstinner commented Sep 29, 2021

jewettaij mannequin commented Sep 29, 2021

jewettaij mannequin commented Sep 29, 2021