classification
Title: tokenize module appears to treat unterminated single and double-quoted strings inconsistently
Type: behavior Stage: resolved
Components: Documentation Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Devin Jeanpierre, amandine, dhaffner, docs@python, petri.lehtinen, python-dev, r.david.murray
Priority: normal Keywords: easy, patch

Created on 2011-05-12 14:19 by Devin Jeanpierre, last changed 2014-06-08 00:56 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
issue12063.patch amandine, 2014-06-07 22:40 review
Messages (5)
msg135836 - (view) Author: Devin Jeanpierre (Devin Jeanpierre) * Date: 2011-05-12 14:19
Tokenizing `' 1 2 3` versus `''' 1 2 3` yields different results.

Tokenizing `' 1 2 3` gives:

1,0-1,1:	ERRORTOKEN	"'"
1,2-1,3:	NUMBER	'1'
1,4-1,5:	NUMBER	'2'
1,6-1,7:	NUMBER	'3'
2,0-2,0:	ENDMARKER	''

while tokenizing `''' 1 2 3` yields:

Traceback (most recent call last):
  File "prog.py", line 4, in <module>
    tokenize.tokenize(iter(["''' 1 2 3"]).next)
  File "/usr/lib/python2.6/tokenize.py", line 169, in tokenize
    tokenize_loop(readline, tokeneater)
  File "/usr/lib/python2.6/tokenize.py", line 175, in tokenize_loop
    for token_info in generate_tokens(readline):
  File "/usr/lib/python2.6/tokenize.py", line 296, in generate_tokens
    raise TokenError, ("EOF in multi-line string", strstart)
tokenize.TokenError: ('EOF in multi-line string', (1, 0))


Apparently tokenize decides to re-tokenize after the erroneous quote in the case of a single-quote, but not a triple-quote. I guess that this is because retokenizing the rest of the file after an unclosed triple-quote would be expensive; however, I've also been told it's very strange and possibly wrong for tokenize to be inconsistent this way.

If this is the right behavior, I guess I'd like it if it were documented. This sort of thing is confusing / potentially misleading for users of the tokenize module. Or at least, when I saw how single quotes were handled, I assumed incorrectly that all quotes were handled that way.
msg137028 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2011-05-27 07:07
tokenize processes a line at a time, and noticing that an ending triple quote is missing would mean reading the whole file in the worst case. As tokenize seems to work in a generator-like fashion, it's probably not desired to cache all the input to be able to restart from some previous line.

So, I'd go with documenting the behavior.
msg138261 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-13 16:18
I agree with Petri, so I'm setting this to a doc issue.
msg219991 - (view) Author: Amandine Lee (amandine) Date: 2014-06-07 22:40
I confirmed that the behavior acts as described. I added a patch documenting the behavior, built the docs with the patch, and visually confirmed that the docs looks appropriate. 

Ready for review!
msg220008 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-06-08 00:56
New changeset 188e5f42d4aa by Benjamin Peterson in branch '2.7':
document TokenError and unclosed expression behavior (closes #12063)
http://hg.python.org/cpython/rev/188e5f42d4aa

New changeset ddc174c4c7e5 by Benjamin Peterson in branch '3.4':
document TokenError and unclosed expression behavior (closes #12063)
http://hg.python.org/cpython/rev/ddc174c4c7e5

New changeset 3f2f1ffc3ce2 by Benjamin Peterson in branch 'default':
merge 3.4 (#12063)
http://hg.python.org/cpython/rev/3f2f1ffc3ce2
History
Date User Action Args
2014-06-08 00:56:03python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg220008

resolution: fixed
stage: needs patch -> resolved
2014-06-07 22:40:03amandinesetfiles: + issue12063.patch

nosy: + amandine
messages: + msg219991

keywords: + patch
2011-07-26 21:30:59dhaffnersetnosy: + dhaffner
2011-07-24 17:57:02petri.lehtinensetkeywords: + easy
2011-06-13 16:18:46r.david.murraysetassignee: docs@python
type: behavior
components: + Documentation

nosy: + docs@python, r.david.murray
messages: + msg138261
stage: needs patch
2011-05-27 07:11:55petri.lehtinensetversions: + Python 2.7, Python 3.2, Python 3.3
2011-05-27 07:07:53petri.lehtinensetnosy: + petri.lehtinen
messages: + msg137028
2011-05-12 14:19:29Devin Jeanpierrecreate