Title: shlex behaves unexpected if newlines are not whitespace
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: elliotta, ferringb, ggenellina, iritkatriel, jjdmol2, kanru
Priority: normal Keywords: patch

Created on 2009-10-09 08:11 by jjdmol2, last changed 2021-06-14 20:18 by iritkatriel.

File name Uploaded Description Edit jjdmol2, 2009-10-09 08:11
lexer-newline-tokens.patch jjdmol2, 2009-10-09 08:25
lexer-newline-tokens-patch-2.0.patch jjdmol2, 2009-12-31 08:36 improved patch, includes test cases review
Messages (6)
msg93776 - (view) Author: Jan David Mol (jjdmol2) Date: 2009-10-09 08:11
The shlex module does not function as expected in the presence of
comments when newlines are not whitespace. An example (attached):

>>> from shlex import shlex
>>> lexer = shlex("a \n b")
>>> print ",".join(lexer)
>>> lexer = shlex("a # comment \n b")
>>> print ",".join(lexer)
>>> lexer = shlex("a \n b")
>>> lexer.whitespace=" "
>>> print ",".join(lexer)
>>> lexer = shlex("a # comment \n b")
>>> lexer.whitespace=" "
>>> print ",".join(lexer)

Now where did my newline go? The comment ate it! Even though the docs
seem to indicate the newline is not part of the comment itself:

    The string of characters that are recognized as comment beginners.
All characters from the comment beginner to end of line are ignored.
Includes just '#' by default.
msg93778 - (view) Author: Jan David Mol (jjdmol2) Date: 2009-10-09 08:25
Attached is a patch which fixes this for me. It basically does a
fall-through using '\n' when encountering a comment. So that may be a
bit of a hack (who says '\n' is the only newline char in there, and not
'\r'?) but I'll leave the more intricate stuff to you experts.
msg93820 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-10-10 03:15
If you could add some tests to lib/test/, there are more 
chances for this patch to be accepted.

Also, consider the case when the comment is on the last line of input 
and there is no \n ending character.
msg97080 - (view) Author: Jan David Mol (jjdmol2) Date: 2009-12-31 08:36
As there seems to be some interest, I've continued working on patching
this issue.

Attached is an improved version of the patch, including additions to Improved in the sense that newlines after a comment are
not considered to be actually part of the comment (according to POSIX),
which makes a difference when newlines are tokens.

To accomplish this, I had to add an ungetc buffer to shlex, in order to
push back any newlines read by the readline() routine used when a
comment is encountered.

@Gabriel: the test case of no newline at the end of the file after a
comment is addressed.

Relevant POSIX sections are
Shell & Utilities 2.3(10)
Rationale C.2.3
msg141486 - (view) Author: Ann Elliott (elliotta) Date: 2011-08-01 00:58
This error still occurs in version 3.3.0a0.
msg395847 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-06-14 20:18
I've reproduced on 3.11.
Date User Action Args
2021-06-14 20:18:08iritkatrielsetnosy: + iritkatriel
messages: + msg395847
2021-06-14 20:17:01iritkatrielsetversions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.6, Python 3.1, Python 2.7, Python 3.2, Python 3.3
2011-08-01 00:58:26elliottasetnosy: + elliotta

messages: + msg141486
versions: + Python 3.3
2009-12-31 08:37:00jjdmol2setfiles: + lexer-newline-tokens-patch-2.0.patch

messages: + msg97080
2009-12-31 08:00:04ezio.melottisetpriority: normal
nosy: + ferringb
versions: - Python 2.5, Python 2.4, Python 3.0

stage: test needed
2009-12-31 03:17:37kanrusetnosy: + kanru
2009-10-10 03:15:30ggenellinasetnosy: + ggenellina
messages: + msg93820
2009-10-09 09:41:02jjdmol2setcomponents: + Library (Lib)
2009-10-09 08:25:20jjdmol2setfiles: + lexer-newline-tokens.patch
keywords: + patch
messages: + msg93778
2009-10-09 08:11:52jjdmol2create