Message 144074 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	wombat
Recipients	Santiago.Romero, belopolsky, benjamin.peterson, cgwalters, dexen, doughellmann, eric.araujo, ezio.melotti, fperez, loewis, mark.dickinson, mcepl, nwerneck, orsenthil, r.david.murray, rhettinger, vstinner, wombat
Date	2011-09-15.11:25:12
SpamBayes Score	2.4987235e-12
Marked as misclassified	No
Message-id	<1316085914.5.0.437382682382.issue1170@psf.upfronthosting.co.za>
In-reply-to

Content
Proposed solution and patch to follow. Please let me know if I am posting it in the wrong place. The main problem with shlex is that the shlex interface is inadequate to handle unicode. Specifically it is no longer feasible to provide a list of every possible character that the user could want to appear within a token. Suppose the user wants the ability to parse words in simplified Chinese. If I understand correctly, then currently, they would have to set "self.wordchars" to a string (or some other container) of 6000 (unicode) characters, and this enormous string would need to be searched each time a new character is read. This was a problem with shlex from the beginning, but it became more acute when support for unicode was added. Generally, in some cases, it is much more convenient instead to specify a short list of characters you -don't- want to appear in a word (word delimiters), than to list all the characters you do. An obvious (although perhaps not optimal) solution is to add an additional data member to shlex, consisting of the characters which terminate the reading of a token. (In other words, the set-inverse of wordchars.) In the attached example code, I call it "self.wordterminators". To remain backwards-compatible with shlex, self.wordterminators is empty by default. But if not-empty, self.wordterminators overrides self.wordchars. I've been distributing a customized version of shlex with my own software which implements this modest change (shlex_wt). (See attachment.) It is otherwise identical to the version of shlex.py that ships with python 3.2.2. (It has been further modified only slightly to be compatible with both python 2.7 and python 3.) It's not beautiful code, but it seems to be a successful kluge for this particular issue. I don't know if it makes a worthy patch, but perhaps somebody out there finds it useful. To make it easy to spot the changes, each of the lines I changed ends in a comment "#WORDTERMINATORS". (There are only 15 of these lines.) -Andrew Jewett

Proposed solution and patch to follow. Please let me know if I am posting it in the wrong place.

The main problem with shlex is that the shlex interface is inadequate to handle unicode. Specifically it is no longer feasible to provide a list of every possible character that the user could want to appear within a token. Suppose the user wants the ability to parse words in simplified Chinese. If I understand correctly, then currently, they would have to set "self.wordchars" to a string (or some other container) of 6000 (unicode) characters, and this enormous string would need to be searched each time a new character is read. This was a problem with shlex from the beginning, but it became more acute when support for unicode was added. Generally, in some cases, it is much more convenient instead to specify a short list of characters you -don't- want to appear in a word (word delimiters), than to list all the characters you do.

An obvious (although perhaps not optimal) solution is to add an additional data member to shlex, consisting of the characters which terminate the reading of a token. (In other words, the set-inverse of wordchars.) In the attached example code, I call it "self.wordterminators". To remain backwards-compatible with shlex, self.wordterminators is empty by default. But if not-empty, self.wordterminators overrides self.wordchars.

I've been distributing a customized version of shlex with my own software which implements this modest change (shlex_wt). (See attachment.) It is otherwise identical to the version of shlex.py that ships with python 3.2.2. (It has been further modified only slightly to be compatible with both python 2.7 and python 3.) It's not beautiful code, but it seems to be a successful kluge for this particular issue. I don't know if it makes a worthy patch, but perhaps somebody out there finds it useful. To make it easy to spot the changes, each of the lines I changed ends in a comment "#WORDTERMINATORS". (There are only 15 of these lines.)
-Andrew Jewett

History
Date	User	Action	Args
2011-09-15 11:25:15	wombat	set	recipients: + wombat, loewis, rhettinger, mark.dickinson, belopolsky, orsenthil, vstinner, dexen, benjamin.peterson, cgwalters, mcepl, ezio.melotti, eric.araujo, doughellmann, r.david.murray, nwerneck, fperez, Santiago.Romero
2011-09-15 11:25:14	wombat	set	messageid: <1316085914.5.0.437382682382.issue1170@psf.upfronthosting.co.za>
2011-09-15 11:25:13	wombat	link	issue1170 messages
2011-09-15 11:25:13	wombat	create