Issue 24869: shlex lineno inaccurate with certain inputs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69057

classification

Title:	shlex lineno inaccurate with certain inputs
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.6, Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	gdr@garethrees.org, hoadlck, petri.lehtinen, rescrv, vinay.sajip
Priority:	normal	Keywords:	patch

Created on 2015-08-14 17:06 by rescrv, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
badlex.py	rescrv, 2015-08-14 17:06	Example of described behavior.
ambigious_shlex.py	hoadlck, 2016-06-12 12:09	Example of ambigious token stream
issue24869.patch	gdr@garethrees.org, 2017-02-01 14:28		review

Pull Requests
URL	Status	Linked	Edit
PR 2799	open	gdr@garethrees.org, 2017-07-21 10:21

Messages (6)
msg248596 - (view)	Author: Robert Escriva (rescrv)	Date: 2015-08-14 17:06
The newlines calculated by the shlex module are inaccurate for certain inputs with comments inline. I've attached a simple script that illustrates the problem. My assumption here is that the lineno is supposed to match a line related to the current token. I'm trying to use changes in the lineno to aggregate tokens into commands. This may not be an intended use case.
msg268368 - (view)	Author: Christopher Hoadley (hoadlck) *	Date: 2016-06-12 12:09
This problem makes it impossible to use shlex to parse commands where a newline is intended to separate commands. In the attached sample script, I created two input strings with the same tokens in the same order: the only difference is newlines and spaces. In the first string, each token is on its own line, and in the second string the first 2 tokens are on the same line, and the third is on its own. If you look at the lineno association with each token, it is identical between the two strings. But, the two strings have completely different meanings! I have no way to distinguish the behavior. If I want to use the feature of shlex where it will automatically include other command files, then I can't just sanitize the input before sending it on. As it is, the only way that I can see that I can use shlex is if my command language uses some other symbol (i.e. ";") as a command line separator. Since I am defining my own command language, I can do that, but it adds needless complication for the users.
msg268454 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2016-06-13 16:18
Just to restate the problem: The use case is that when emitting an error message for a token, we want to include the number of the line containing the token (or the number of the line where the token started, if the token spans multiple lines, as it might if it's a string containing newlines). But there is no way to satisfy this use case given the features of the shlex module. In particular, shlex.lineno (which looks as if it ought to help) is actually the line number of the first character that has not yet been consumed by the lexer, and in general this is not the same as the line number of the previous (or the next) token. I can think of two alternatives that would satisfy the use case: 1. Instead of returning tokens as str objects, return them as instances of a subclass of str that has a property that gives the line number of the first character of the token. (Maybe it should also have properties for the column number of the first character, and the line and column number of the last character too? These properties would support better error messages.) 2. Add new methods that return tuples giving the token and its line number (and possibly column number etc. as in alternative 1). My preference would be for alternative (1), but I suppose there is a very tiny risk of breaking some code that relied upon get_token returning an instance of str exactly rather than an instance of a subclass of str.
msg268458 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2016-06-13 17:53
A third alternative: 3. Add a method whose effect is to consume comments and whitespace, but which does not yield a token. You could then call this method, and then look at shlex.lineno, which will be the line number of the first character of the next token (if there is a next token).
msg286637 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2017-02-01 14:28
Here's a patch that implements my proposal (1) -- under this patch, tokens read from an input stream belong to a subtype of str with startline and endline attributes giving the line numbers of the first and last character of the token. This allows the accurate reporting of error messages relating to a token. I updated the documentation and added a test case.
msg298792 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2017-07-21 10:24
I've made a pull request. (Not because I expect it to be merged as-is, but to provide a starting point for discussion.)

History
Date	User	Action	Args
2022-04-11 14:58:19	admin	set	github: 69057
2017-07-21 10:24:17	gdr@garethrees.org	set	nosy: + vinay.sajip, petri.lehtinen messages: + msg298792
2017-07-21 10:21:56	gdr@garethrees.org	set	pull_requests: + pull_request2849
2017-02-01 14:28:13	gdr@garethrees.org	set	files: + issue24869.patch keywords: + patch messages: + msg286637
2016-12-27 13:19:31	hoadlck	set	type: behavior versions: + Python 3.6
2016-06-13 17:53:11	gdr@garethrees.org	set	messages: + msg268458
2016-06-13 16:18:22	gdr@garethrees.org	set	nosy: + gdr@garethrees.org messages: + msg268454
2016-06-12 12:09:02	hoadlck	set	files: + ambigious_shlex.py versions: + Python 3.5, - Python 3.4 nosy: + hoadlck messages: + msg268368
2015-08-14 17:06:03	rescrv	create