New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shlex have problems with parsing unicode #45511
Comments
Feeding unicode to shlex object created in POSIX compat mode causes |
A quick paste to illustrate: the exception is raised only when unicode dexen!muraena!~$ python
Python 2.5.1 (r251:54863, May 4 2007, 16:52:23)
[GCC 4.1.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from StringIO import StringIO
>>> import shlex
>>> lx = shlex.shlex( StringIO( unicode( "abc" ) ) )
>>> lx.get_token()
u'abc'
>>> lx = shlex.shlex( StringIO( unicode( "abc" ) ), None, True )
>>> lx.get_token()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib/python2.5/shlex.py", line 150, in read_token
elif nextchar in self.wordchars:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position
63: ordinal not in range(128)
>>> |
One remark to previous message: BTW, that so-called POSIX mode would be more POSIX-ish, if instead of |
Patch to add Unicode support. Note: this patch recodes shlex.py from iso-8859-1 to utf-8, so it has |
Hello. I tried to patch my own shlex, and this doens't seem to be Is there any conversion steps missing?... mymachine$ python interp.py < exemplo.prg
Traceback (most recent call last):
File "interp.py", line 11, in <module>
tok = ss.get_token()
File "shlexutf.py", line 103, in get_token
raw = self.read_token()
File "shlexutf.py", line 139, in read_token
nextcategory = unicodedata.category(nextchar)
TypeError: category() argument 1 must be unicode, not str |
OK, it worked after I found out I didn't know how to open unicode |
The patch needs tests before it can be applied. Additionally, I'm not |
shlex in 3.x works with Unicode strings. Is it still time to try to fix this bug for 2.7? |
shlex may use unicode in py3k, but since the file still starts with a latin-1 coding cookie and the posix logic hasn't been changed, I suspect that it does not work correctly (ie: does not correctly identify word characters, per msg55969). It's too late for 2.7 I think, but it seems there is work still to do in py3k. |
I believe the e-mail thread that culminated in r32284, "Implemented posix-mode parsing support in shlex.py", was "shellwords" from April 2003: I scanned through the messages, but could not find a reference to the standard that was implemented. |
Here is an illustration of the problem with a simple test case (the value of the posix flag doesn't make any difference): Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> list(shlex.shlex('ab'))
['ab']
>>> list(shlex.shlex(u'ab', posix=True))
['a', '\x00', '\x00', '\x00', 'b', '\x00', '\x00', '\x00']
>>> list(shlex.shlex(u'ab', posix=False))
['a', '\x00', '\x00', '\x00', 'b', '\x00', '\x00', '\x00']
>>> |
Fernando, Is this 2.7 only problem? In 3.2 >>> list(shlex.shlex('ab'))
['ab'] and bytes are not supported. >> list(shlex.shlex(b'ab'))
Traceback (most recent call last):
..
AttributeError: 'bytes' object has no attribute 'read' It is debatable whether either is a bug. |
Yes, sorry that I failed to mention the example I gave applies only to 2.x, not to 3.x. |
On Tue, Jul 27, 2010 at 2:26 PM, Fernando Perez <report@bugs.python.org> wrote:
Why do you expect shlex to work with unicode in 2.x? The What's your take on accepting bytes in 3.x? |
On Tue, Jul 27, 2010 at 11:52, Alexander Belopolsky
Well, I didn't make the original report, just provided a short,
Mmh... Not too sure. I'd think about it from the perspective of what But take my opinion on 3.x with a big grain of salt, I have very Cheers, f |
+1 on get shlex to work better with Unicode. The core concepts of this module are general purpose and applicable to all kinds of text. |
On Tue, Jul 27, 2010 at 3:04 PM, Raymond Hettinger
In 2.7.x? It more or less works in 3.x already. |
Alexander: the "more or less" is on the "less" side when dealing with non-ASCII letters, I think. See my msg109292 and your own followups. |
David, What do you think about attached patch? Would that be a change in the "more" direction? |
I am adding MvL to nosy. Martin, I believe you are the ultimate authority on how to tokenize a unicode stream. |
I don't like my patch anymore because it breaks code that manipulates public wordchars attribute. Users may want to set it to their own alphabet or append additional characters to the default list. Maybe wordchars should always be "non-posix" wordchars and iswordchar posix mode test be c.isalnum() or c in self.wordchars? |
Adding bpo-10587 because we need to figure out the exact meaning of str.isspace() etc. first. It is possible that for proper operation shlex should consult unicodedata directly. |
The key requirement to consider for in POSIX compatible mode is, well, POSIX compatibility, which is defined in http://www.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html Now, POSIX declares that what <blank> is depends on LC_CTYPE (character class blank). I'd argue that if the objective is to behave exactly like the shell, it really should be doing that (i.e. work in a locale-aware manner). |
I think I'm suffering the same problem in some small programs that use shlex:
>>> text = "python and shlex"
>>> shlex.split(text)
['python', 'and', 'shlex']
>>> text = u"python and shlex"
>>> shlex.split(text)
['p\x00\x00\x00y\x00\x00\x00t\x00\x00\x00h\x00\x00\x00o\x00\x00\x00n\x00\x00\x00', '\x00\x00\x00a\x00\x00\x00n\x00\x00\x00d\x00\x00\x00', '\x00\x00\x00s\x00\x00\x00h\x00\x00\x00l\x00\x00\x00e\x00\x00\x00x\x00\x00\x00'] I'm currently using the following "basic" workaround (while assuming that my strings have only ascii chars): >>> [ x.replace("\0", "") for x in shlex.split(text) ]
['python', 'and', 'shlex'] It would be very nice if shlex could work with unicode strings ... Thanks. |
This isn't going to get fixed in 2.x (shlex doesn't support unicode in 2.x, and doing so would be a new feature). In 3.x all strings are unicode, so the problem you are seeing doesn't exist. This issue is about the broader problem of what counts as a word character when more than ASCII is involved. |
Right. Any program that needs to parse command lines containing filenames or other arguments with unicode characters will encounter this problem. |
We all recognize that ASCII is very much limited and that the real way to work with strings is Unicode. However, here our hands are tied by our development process: shlex in 2.x does not support Unicode, adding that support would be a new feature, and 2.7 is closed to new features. If shlex was supposed to support Unicode, then this would be a bug that could be fixed in 2.7, but it’s not. All we can do is improve the 2.7 doc to show how to work around that (splitting on bytes and then decoding each chunk, for example). |
Is unicode supported by shlex in 3.x already? It's curious that unicode support is considered a new feature, rather than a bug. I understand wanting to allocate development resources carefully, though. If someone were to prepare a patch, would it even have a chance of being accepted in 2.7? |
See http://bugs.python.org/issue1170#msg106424 and following. |
It’s not about allocating resources, it’s about following process. The first part is that we don’t add new features to stable releases, the second item is that this is not considered a bug fix: The code pre-dates Unicode, was not updated to support it, and the docs say “The shlex module currently does not support Unicode input”. |
Proposed solution and patch to follow. Please let me know if I am posting it in the wrong place. The main problem with shlex is that the shlex interface is inadequate to handle unicode. Specifically it is no longer feasible to provide a list of every possible character that the user could want to appear within a token. Suppose the user wants the ability to parse words in simplified Chinese. If I understand correctly, then currently, they would have to set "self.wordchars" to a string (or some other container) of 6000 (unicode) characters, and this enormous string would need to be searched each time a new character is read. This was a problem with shlex from the beginning, but it became more acute when support for unicode was added. Generally, in some cases, it is much more convenient instead to specify a short list of characters you -don't- want to appear in a word (word delimiters), than to list all the characters you do. An obvious (although perhaps not optimal) solution is to add an additional data member to shlex, consisting of the characters which terminate the reading of a token. (In other words, the set-inverse of wordchars.) In the attached example code, I call it "self.wordterminators". To remain backwards-compatible with shlex, self.wordterminators is empty by default. But if not-empty, self.wordterminators overrides self.wordchars. I've been distributing a customized version of shlex with my own software which implements this modest change (shlex_wt). (See attachment.) It is otherwise identical to the version of shlex.py that ships with python 3.2.2. (It has been further modified only slightly to be compatible with both python 2.7 and python 3.) It's not beautiful code, but it seems to be a successful kluge for this particular issue. I don't know if it makes a worthy patch, but perhaps somebody out there finds it useful. To make it easy to spot the changes, each of the lines I changed ends in a comment "#WORDTERMINATORS". (There are only 15 of these lines.) |
Not to get side-tracked, but on a related note, it would be nice if there was a python module which defined sets of unicode characters corresponding to different categories (similar to the categories listed here: http://www.fileformat.info/info/unicode/category/index.htm) self.wordterminators = unicode_math + unicode_punctuation.
(The + means set union.) If somebody tried to specify all of them manually, this would be painful. There are hundreds of punctuation symbols in unicode, for example. (I suppose most of the time, one does not need to be so thorough. This feature not really necessary for getting shlex to work. But I think this would be a easy feature to add.) |
That can be done programmatically using the unicodedata module. The regex module (that will hopefully be include in 3.3) is also able to match characters that belongs to specific categories. |
Andrew: Thanks for your contribution, but your patch cannot go into 2.7, as we don’t add new features in stable versions (re-read the whole thread if you need more info). This report is still open because we need a doc patch to explain how to work around that. |
Ezio: Thanks. (New to me, actually) Is this what you mean?:
Eric: That's fine. I just posted here because this page currently gets the top hit when searching for "shlex unicode". If you think it's appropriate to repost my message for python version 3.4, let me know. The issue with shlex.wordchars that I raised is valid for any version of python. I'm not sure my solution is optimal. (I like the regex idea). |
Andrew: Ezio means http://docs.python.org/2.7/library/unicodedata
|
Ezio, I don't see any indication in this ticket that this bug was actually *fixed* in 3.x. Unicode doesn't cause immediate errors in 3.x, but it isn't recognized as wordchars, etc. Am I missing something? |
I haven't looked at the shlex code (yet), my comment was just about the idea of adding constants with chars that belong to different Unicode categories. |
$ ./python
Python 2.7.2+ (2.7:27ae7d4e1983+, Oct 23 2011, 00:09:06)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shlex
>>> shlex.split(u'Hello, World!')
['Hello,', 'World!'] This bug was fixed indirectly by a StringIO fix in 27ae7d4e1983, for bpo-1548891. BTW, this report was a duplicate of bpo-6988, closed a year ago. Python 2.7.3 will finally support unicode in shlex, so the doc change requested in this report is outdated. However, I still want to do something for this. I’ve noticed that shlex.split’s argument can be a file-like object, and I wonder if passing a StringIO.StringIO(my_unicode_string) wouldn’t work. If such a short recipe works, I’m all for including it in the 2.7 docs for users of older versions. If a longer recipe is needed, then ActiveState’s Python Cookbook would be more appropriate, and I’ll add a link to the docs. If it’s very long and requires a PyPI project, then I’m willing to link to that. |
The second message in this page reports that StringIO.StringIO works, but when I pass a unicode string with non-ASCII chars there’s a method call that fails because of implicit unicode-to-str conversion: Traceback (most recent call last):
File "/usr/lib/python2.7/shlex.py", line 150, in read_token
elif nextchar in self.wordchars:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 63: ordinal not in range(128) I’ll try to create a Shlex instance, replace self.wordchars with a decoded version and try again. |
This issue is 12 years old has 3 patches: it's far from being "newcomer friendly", I remove the "Easy" label. |
I cannot reproduce it with the current 3.* version. Did anybody reproduce with 3.5? Otherwise, I suggest close this, as a 2.* bug. |
This issue has been fixed in Python 3 by using Unicode rather than bytes in shlex. Python 2 users: it's time to upgrade to Python 3 ;-) |
The error messages may have gone away, but the underlying unicode limitations I mentioned remain: Suppose you wanted to use shlex to build a parser for Chinese text. Would you have to set "wordchars" to a string containing every possible Chinese character? I myself wrote a parser for a crude language where words can contain any character except for whitespace and parenthesis. I needed a way to specify the characters which cannot belong to a word. (That's how I solved the problem. I modified shlex.py and added a "wordterminators" member. If "wordterminators" was left blank, then "wordchars" were used instead. This was a trivial change to "shlex.py" and it added a lot of functionality.) I would like to suggest making this change (or something similar) to the official version of "shlex.py". Would sending an email to "python-ideas@python.org" be a good place to make this proposal? |
Yes, python-ideas is a good place to start discussion such idea. This issue is closed, if you discuss it here, you will get a limited audience. |
After posting that, I noticed that the second example I listed in my previous post (a language where words contain any non-whitespace, non-parenthesis character) can now be implemented in the current version of shlex.py by setting "whitespace_true" and "punctuation". (Sorry, it's been a while since I looked at shlex.py, and it's had some usefl new features.) |
Alright. I'll think about it a little more and post my suggestion there, perhaps. Thanks Victor. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: