This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: shlex.shlex should not augment wordchars
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Andrey.Kislyuk, Gustavo Goretkin, cvrebert, eric.araujo, eric.smith, evan_, ezio.melotti, ned.deily, python-dev, r.david.murray, robodan, vinay.sajip
Priority: normal Keywords: patch

Created on 2016-11-03 09:42 by evan_, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
without_augmenting_chars.diff evan_, 2016-11-03 09:42 Proposed changes to shlex.shlex for 3.6 review
shlex.diff evan_, 2017-01-02 13:01 Proposed changes to shlex.shlex for 3.7 review
shlex2.diff evan_, 2017-01-19 10:05 Proposed changes for 3.7, updated to master review
Pull Requests
URL Status Linked Edit
PR 2071 merged evan_, 2017-06-10 06:49
Messages (13)
msg279980 - (view) Author: Evan Andrews (evan_) * Date: 2016-11-03 09:42
The changes to shlex due to land in 3.6 use a predefined set of characters to "augment" wordchars, however this set is incomplete. For example, 'foo,bar' should be parsed as a single token, but it is split on the comma:

$ echo foo,bar
foo,bar

>>> import shlex
>>> list(shlex.shlex('foo,bar', punctuation_chars=True))
['foo', ',', 'bar']

(For context on where this was encountered, see https://github.com/kislyuk/argcomplete/issues/161)

Instead of trying to enumerate all possible wordchars, I think a more robust solution is to use whitespace_split to include *all* characters not otherwise considered special.

Ideally this would be fixed before 3.6 is released to avoid needing to maintain backwards compatibility with the current behaviour, although I understand the timeline may make this difficult.

I've attached a patch with proposed changes, including updates to the tests to demonstrate the effective difference. I can make the corresponding documentation changes if we want this merged.

(I've added everyone to the nosy list from http://bugs.python.org/issue1521950 where these changes originated.)
msg280734 - (view) Author: Evan Andrews (evan_) * Date: 2016-11-14 06:01
I have some additional concerns with the changes introduced in http://bugs.python.org/issue1521950:

1. The fact that posix defaults to False in shlex.shlex (but not in shlex.split) is a huge beginner trap. If users are expecting to use this for "compatibility with the parsing performed by common Unix shells like bash, dash, and sh", they must also remember to set posix=True. The documentation for punctuation_chars makes no mention of this. The examples use non-POSIX mode parsing rules.

2. Even with posix=True, there is no mechanism to escape characters that are special inside double quotes, like $. This is a separate unaddressed incompatibility.

For 1, this could be fixed by making posix default to True if punctuation_chars is specified. Longer term, perhaps the default could be changed to True in all cases. I'm not sure what to do about 2.

(Please let me know if I should split these out into separate issues.)
msg280939 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-11-16 12:32
Per our conversation on IRC: since this is not a regression, it is unlikely to be addressed at this stage of the 3.6 release cycle.  I'm adding the release manager as nosy just in case, but I don't expect him to want this to be addressed now.

IMO if the behavior does not match "the posix shell", then it is a bug, and can be fixed in 3.6.1.  (I'd prefer not to wait longer than that, though).  So proposing a patch with tests and doc changes would be beneficial.

As for the flag defaults, that's a backward compatibility issue.  I don't like the idea of changing the default based on the value of another flag, but it could be discussed.  The 'longer term' would be a several release deprecation cycle...I think that has been discussed previously but I can't find the discussion in a quick search.  The documentation, however, could be improved.  Please open a separate issue for the doc change request at least.
msg281613 - (view) Author: Evan Andrews (evan_) * Date: 2016-11-24 07:47
I've created issue28784 to capture the documentation fixes. When I have more spare time, I'll work on a more complete patch.
msg284482 - (view) Author: Evan Andrews (evan_) * Date: 2017-01-02 13:01
I've attached a more complete patch. This takes the conservative approach of retaining all functionality of 3.6 and treating this as a new feature for 3.7. I don't think this is suitable for 3.6.1 given this new behavior contradicts what is currently written in the documentation (though this can be discussed further).

I've also run into a couple of other bugs that I've made separate issues for:

* The example in the documentation mixes up 'Old' and 'New' (issue29133)
* The updated example contains a bug (the '>abc' token should actually be two tokens) (issue29132)

These should both be fixed in 3.6.1 regardless of where this goes.
msg285771 - (view) Author: Evan Andrews (evan_) * Date: 2017-01-19 10:05
Attaching an updated patch now that the two linked issues are resolved.
msg285929 - (view) Author: Gustavo Goretkin (Gustavo Goretkin) Date: 2017-01-21 00:45
>Instead of trying to enumerate all possible wordchars, I think a more robust solution is to use whitespace_split to include *all* characters not otherwise considered special.

I agree with that approach.

Also note that dash/hyphen gets incorrectly tokenized.

>>> import shlex
>>> list(shlex.shlex("mkdir -p somepath"))
['mkdir', '-', 'p', 'somepath']

White listing all valid word characters is not good, because the surrogateescape mechanism can include all sorts of "characters".

In bash:

$ echo mkdir $(echo -ne "Bad\xffButLegalPath")
mkdir Bad?ButLegalPath

the path is one token.

However currently in shlex, it gets broken into multiple tokens:

>>> list(shlex.shlex(b"mkdir Bad\ffButLegalPath".decode("utf-8", "surrogoateescape")))
['mkdir', 'Bad', '\x0c', 'fButLegalPath']
msg285930 - (view) Author: Gustavo Goretkin (Gustavo Goretkin) Date: 2017-01-21 01:19
Sorry, I typo'd that last example pretty badly. Should be

>>> list(shlex.shlex(b"mkdir Bad\xffButLegalPath".decode("utf-8", "surrogateescape")))
['mkdir', 'Bad', '\udcff', 'ButLegalPath']
msg285932 - (view) Author: Evan Andrews (evan_) * Date: 2017-01-21 02:13
Unfortunately shlex.shlex's defaults are probably going to remain that way for a long time in order to avoid breaking backwards compatibility. Presumably shlex.split was added so you didn't have to remember to set posix and whitespace_split yourself.

The particular problem I'm addressing in this issue is that the new punctuation_chars argument doesn't currently work with whitespace_split.

>>> def split(text, ws=False, pc=False):
...     s = shlex.shlex(text, posix=True, punctuation_chars=pc)
...     s.whitespace_split = ws
...     return list(s)
...
>>> split('foo,bar>baz')
['foo', ',', 'bar', '>', 'baz']
>>> split('foo,bar>baz', ws=True)
['foo,bar>baz']
>>> split('foo,bar>baz', pc=True)
['foo', ',', 'bar', '>', 'baz']
>>> split('foo,bar>baz', ws=True, pc=True)
['foo,bar>baz']

With my patch, the last example outputs ['foo,bar', '>', 'baz'].

Before the release of 3.6 I was arguing that punctuation_chars should not attempt to augment wordchars at all, since the idea of wordchars is inherently incorrect as you point out. Now I think it's too late to change, hence my patch treats this as a new feature in 3.7.
msg293611 - (view) Author: Evan Andrews (evan_) * Date: 2017-05-13 13:36
Do I need to create a pull request for this?
msg293646 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2017-05-14 13:11
> Do I need to create a pull request for this?

Yes, creating a PR is part of the current workflow since we moved to GitHub. Though that's not the reason why it's not been progressed - on my part, at least, it's due to not having had time to think through the implications of this change.
msg295617 - (view) Author: Evan Andrews (evan_) * Date: 2017-06-10 07:00
Thanks, Vinay. I understand you're busy, and I'm in no particular rush to have this looked at, so feel free to come back to it when you have more time. I've submitted the changes as a PR.

I also have an additional concern - even with this change, there is no way to tell whether a token should be considered special or text:

    >>> import shlex
    >>> def split(line):
    ...     s = shlex.shlex(line, posix=True, punctuation_chars=True)
    ...     s.whitespace_split = True
    ...     return list(s)
    ...
    >>> split('a&&b')
    ['a', '&&', 'b']
    >>> split('a "&&" b')
    ['a', '&&', 'b']

I feel this can be addressed after this as a separate issue but wanted to mention it now so you're aware.
msg344201 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-06-01 19:09
New changeset 56624a99a916fd27152d5b23364303acc0d707de by Vinay Sajip (Evan) in branch 'master':
bpo-28595: Allow shlex whitespace_split with punctuation_chars (GH-2071)
https://github.com/python/cpython/commit/56624a99a916fd27152d5b23364303acc0d707de
History
Date User Action Args
2022-04-11 14:58:39adminsetgithub: 72781
2019-06-19 14:34:27vinay.sajipsetstatus: open -> closed
stage: resolved
resolution: fixed
versions: + Python 3.8, - Python 3.6, Python 3.7
2019-06-01 19:09:25vinay.sajipsetmessages: + msg344201
2017-06-10 07:00:33evan_setmessages: + msg295617
2017-06-10 06:49:09evan_setpull_requests: + pull_request2134
2017-05-14 13:11:22vinay.sajipsetmessages: + msg293646
2017-05-13 13:36:42evan_setmessages: + msg293611
2017-01-21 02:13:30evan_setmessages: + msg285932
2017-01-21 01:19:54Gustavo Goretkinsetmessages: + msg285930
2017-01-21 00:45:01Gustavo Goretkinsetnosy: + Gustavo Goretkin
messages: + msg285929
2017-01-19 10:05:27evan_setfiles: + shlex2.diff

messages: + msg285771
2017-01-02 13:01:22evan_setfiles: + shlex.diff

messages: + msg284482
versions: + Python 3.7
2016-11-24 07:47:55evan_setmessages: + msg281613
2016-11-16 12:32:21r.david.murraysetnosy: + ned.deily
messages: + msg280939
2016-11-14 06:01:06evan_setmessages: + msg280734
2016-11-04 11:59:27evan_settitle: shlex.split should not augment wordchars -> shlex.shlex should not augment wordchars
2016-11-03 09:42:16evan_create