Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(3402)

Unified Diff: Doc/library/shlex.rst

Issue 1521950: shlex.split() does not tokenize like the shell
Patch Set: Created 11 months, 3 weeks ago
Use n/p to move between diff chunks; N/P to move between comments. Please Sign in to add in-line comments.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « no previous file | Lib/shlex.py » ('j') | no next file with comments »
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
--- a/Doc/library/shlex.rst Sat Jun 02 18:22:31 2012 +0200
+++ b/Doc/library/shlex.rst Sun Jun 03 18:46:34 2012 +0100
@@ -71,7 +71,7 @@
The :mod:`shlex` module defines the following class:
-.. class:: shlex(instream=None, infile=None, posix=False)
+.. class:: shlex(instream=None, infile=None, posix=False, punctuation_chars=False)
A :class:`shlex` instance or subclass instance is a lexical analyzer object.
The initialization argument, if present, specifies where to read characters
@@ -81,10 +81,22 @@
string, which sets the initial value of the :attr:`infile` attribute. If the
*instream* argument is omitted or equal to ``sys.stdin``, this second
argument defaults to "stdin". The *posix* argument defines the operational
- mode: when *posix* is not true (default), the :class:`shlex` instance will
+ mode: when *posix* is false (the default), the :class:`shlex` instance will
operate in compatibility mode. When operating in POSIX mode, :class:`shlex`
- will try to be as close as possible to the POSIX shell parsing rules.
+ will try to be as close as possible to the POSIX shell parsing rules. The
+ *punctuation_chars* argument provides a way to make the behaviour even
+ closer to how real shells parse. This can take a number of values: the
+ default value, ``False``, preserves the behaviour seen under Python 3.2 and
+ earlier. If set to ``True``, then parsing of the characters ``();<>|&`` is
+ changed: any run of these characters (considered punctuation characters) is
+ returned as a single token. If set to a non-empty string of characters,
+ those characters will be used as the punctuation characters. Any characters
+ in the :attr:`wordchars` attribute that appear in *punctuation_chars* will
+ be removed from :attr:`wordchars`. See :ref:`improved-shell-compatibility`
+ for more information.
+ .. versionchanged:: 3.3
+ The `punctuation_chars` parameter was added.
.. seealso::
@@ -186,7 +198,13 @@
.. attribute:: shlex.wordchars
The string of characters that will accumulate into multi-character tokens. By
- default, includes all ASCII alphanumerics and underscore.
+ default, includes all ASCII alphanumerics and underscore. In POSIX mode, the
+ accented charaters in the Latin-1 set are also included. If
+ :attr:`punctuation_chars` is not empty, the characters ``~-./*?=``, which can
+ appear in filename specifications and command line parameters, will also be
+ included in this attribute, and any characters which appear in
+ ``punctuation_chars`` will be removed from ``wordchars`` if they are present
+ there.
.. attribute:: shlex.whitespace
@@ -217,9 +235,11 @@
.. attribute:: shlex.whitespace_split
- If ``True``, tokens will only be split in whitespaces. This is useful, for
- example, for parsing command lines with :class:`shlex`, getting tokens in a
- similar way to shell arguments.
+ If ``True``, tokens will only be split in whitespaces. If this attribute is
+ ``True``, :attr:`punctuation_chars` will have no effect, and splitting will
+ happen only in whitespaces. When using :attr:`punctuation_chars`, which is
+ intended to provide parsing closer to that implemented by shells, it is
+ advisable to leave ``whitespace_split`` as ``False`` (the default value).
.. attribute:: shlex.infile
@@ -268,6 +288,16 @@
(``''``), in non-POSIX mode, and to ``None`` in POSIX mode.
+.. attribute:: shlex.punctuation_chars
+
+ Characters that will be considered punctuation. Runs of punctuation
+ characters will be returned as a single token. However, note that no
+ semantic validity checking will be performed: for example, '>>>' could be
+ returned as a token, even though it may not recognised as such by shells.
+
+ .. versionadded:: 3.3
+
+
.. _shlex-parsing-rules:
Parsing Rules
@@ -317,3 +347,62 @@
* EOF is signaled with a :const:`None` value;
* Quoted empty strings (``''``) are allowed.
+
+.. _improved-shell-compatibility:
+
+Improved Compatibility with Shells
+----------------------------------
+
+.. versionadded:: 3.3
+
+The :class:`shlex` class provides compatibility with the parsing performed by
+common Unix shells like ``bash``, ``dash``, and ``sh``. To take advantage of
+this compatibility, specify the ``punctuation_chars`` argument in the
+constructor. This defaults to ``False``, which preserves pre-3.3 behaviour.
+However, if it is set to ``True``, then parsing of the characters ``();<>|&``
+is changed: any run of these characters is returned as a single token. While
+this is short of a full parser for shells (which would be out of scope for the
+standard library, given the multiplicity of shells out there), it does allow
+you to perform processing of command lines more easily than you could
+otherwise. To illustrate, you can see the difference in the following snippet::
+
+ import shlex
+
+ for punct in (False, True):
+ if punct:
+ message = 'Old'
+ else:
+ message = 'New'
+ text = "a && b; c && d || e; f >'abc'; (def \"ghi\")"
+ s = shlex.shlex(text, punctuation_chars=punct)
+ print('%s: %s' % (message, list(s)))
+
+which prints out::
+
+ Old: ['a', '&', '&', 'b', ';', 'c', '&', '&', 'd', '|', '|', 'e', ';', 'f', '>', "'abc'", ';', '(', 'def', '"ghi"', ')']
+ New: ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', "'abc'", ';', '(', 'def', '"ghi"', ')']
+
+Of course, tokens will be returned which are not valid for shells, and you'll
+need to implement your own error checks on the returned tokens.
+
+Instead of passing ``True`` as the value for the punctuation_chars parameter,
+you can pass a string with specific characters, which will be used to determine
+which characters constitute punctuation. For example::
+
+ >>> import shlex
+ >>> s = shlex.shlex("a && b || c", punctuation_chars="|")
+ >>> list(s)
+ ['a', '&', '&', 'b', '||', 'c']
+
+.. note:: When ``punctuation_chars`` is specified, the :attr:`~shlex.wordchars`
+ attribute is augmented with the characters ``~-./*?=``. That is because these
+ characters can appear in file names (including wildcards) and command-line
+ arguments (e.g. ``--color=auto``). Hence::
+
+ >>> import shlex
+ >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?',
+ ... punctuation_chars=True)
+ >>> list(s)
+ ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?']
+
+
« no previous file with comments | « no previous file | Lib/shlex.py » ('j') | no next file with comments »

RSS Feeds Recent Issues | This issue
This is Rietveld cbc36f91f3f7