This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: shlex lineno inaccurate with certain inputs
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: gdr@garethrees.org, hoadlck, petri.lehtinen, rescrv, vinay.sajip
Priority: normal Keywords: patch

Created on 2015-08-14 17:06 by rescrv, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
badlex.py rescrv, 2015-08-14 17:06 Example of described behavior.
ambigious_shlex.py hoadlck, 2016-06-12 12:09 Example of ambigious token stream
issue24869.patch gdr@garethrees.org, 2017-02-01 14:28 review
Pull Requests
URL Status Linked Edit
PR 2799 open gdr@garethrees.org, 2017-07-21 10:21
Messages (6)
msg248596 - (view) Author: Robert Escriva (rescrv) Date: 2015-08-14 17:06
The newlines calculated by the shlex module are inaccurate for certain inputs with comments inline.  I've attached a simple script that illustrates the problem.

My assumption here is that the lineno is supposed to match a line related to the current token.  I'm trying to use changes in the lineno to aggregate tokens into commands.  This may not be an intended use case.
msg268368 - (view) Author: Christopher Hoadley (hoadlck) * Date: 2016-06-12 12:09
This problem makes it impossible to use shlex to parse commands where a newline is intended to separate commands.
 
In the attached sample script, I created two input strings with the same tokens in the same order: the only difference is newlines and spaces. In the first string, each token is on its own line, and in the second string the first 2 tokens are on the same line, and the third is on its own.
 
If you look at the lineno association with each token, it is identical between the two strings.  But, the two strings have completely different meanings! I have no way to distinguish the behavior.
 
If I want to use the feature of shlex where it will automatically include other command files, then I can't just sanitize the input before sending it on.
 
As it is, the only way that I can see that I can use shlex is if my command language uses some other symbol (i.e. ";") as a command line separator.  Since I am defining my own command language, I can do that, but it adds needless complication for the users.
msg268454 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2016-06-13 16:18
Just to restate the problem:

The use case is that when emitting an error message for a token, we want to include the number of the line containing the token (or the number of the line where the token started, if the token spans multiple lines, as it might if it's a string containing newlines).

But there is no way to satisfy this use case given the features of the shlex module. In particular, shlex.lineno (which looks as if it ought to help) is actually the line number of the first character that has not yet been consumed by the lexer, and in general this is not the same as the line number of the previous (or the next) token.

I can think of two alternatives that would satisfy the use case:

1. Instead of returning tokens as str objects, return them as instances of a subclass of str that has a property that gives the line number of the first character of the token. (Maybe it should also have properties for the column number of the first character, and the line and column number of the last character too? These properties would support better error messages.)

2. Add new methods that return tuples giving the token and its line number (and possibly column number etc. as in alternative 1).

My preference would be for alternative (1), but I suppose there is a very tiny risk of breaking some code that relied upon get_token returning an instance of str exactly rather than an instance of a subclass of str.
msg268458 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2016-06-13 17:53
A third alternative:

3. Add a method whose effect is to consume comments and whitespace, but which does not yield a token. You could then call this method, and then look at shlex.lineno, which will be the line number of the first character of the next token (if there is a next token).
msg286637 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2017-02-01 14:28
Here's a patch that implements my proposal (1) -- under this patch, tokens read from an input stream belong to a subtype of str with startline and endline attributes giving the line numbers of the first and last character of the token. This allows the accurate reporting of error messages relating to a token. I updated the documentation and added a test case.
msg298792 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2017-07-21 10:24
I've made a pull request. (Not because I expect it to be merged as-is, but to provide a starting point for discussion.)
History
Date User Action Args
2022-04-11 14:58:19adminsetgithub: 69057
2017-07-21 10:24:17gdr@garethrees.orgsetnosy: + vinay.sajip, petri.lehtinen
messages: + msg298792
2017-07-21 10:21:56gdr@garethrees.orgsetpull_requests: + pull_request2849
2017-02-01 14:28:13gdr@garethrees.orgsetfiles: + issue24869.patch
keywords: + patch
messages: + msg286637
2016-12-27 13:19:31hoadlcksettype: behavior
versions: + Python 3.6
2016-06-13 17:53:11gdr@garethrees.orgsetmessages: + msg268458
2016-06-13 16:18:22gdr@garethrees.orgsetnosy: + gdr@garethrees.org
messages: + msg268454
2016-06-12 12:09:02hoadlcksetfiles: + ambigious_shlex.py
versions: + Python 3.5, - Python 3.4
nosy: + hoadlck

messages: + msg268368
2015-08-14 17:06:03rescrvcreate