Message 135836 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Devin Jeanpierre
Recipients	Devin Jeanpierre
Date	2011-05-12.14:19:29
SpamBayes Score	4.2682868e-07
Marked as misclassified	No
Message-id	<1305209970.17.0.31373709531.issue12063@psf.upfronthosting.co.za>
In-reply-to

Content
Tokenizing `' 1 2 3` versus `''' 1 2 3` yields different results. Tokenizing `' 1 2 3` gives: 1,0-1,1: ERRORTOKEN "'" 1,2-1,3: NUMBER '1' 1,4-1,5: NUMBER '2' 1,6-1,7: NUMBER '3' 2,0-2,0: ENDMARKER '' while tokenizing `''' 1 2 3` yields: Traceback (most recent call last): File "prog.py", line 4, in <module> tokenize.tokenize(iter(["''' 1 2 3"]).next) File "/usr/lib/python2.6/tokenize.py", line 169, in tokenize tokenize_loop(readline, tokeneater) File "/usr/lib/python2.6/tokenize.py", line 175, in tokenize_loop for token_info in generate_tokens(readline): File "/usr/lib/python2.6/tokenize.py", line 296, in generate_tokens raise TokenError, ("EOF in multi-line string", strstart) tokenize.TokenError: ('EOF in multi-line string', (1, 0)) Apparently tokenize decides to re-tokenize after the erroneous quote in the case of a single-quote, but not a triple-quote. I guess that this is because retokenizing the rest of the file after an unclosed triple-quote would be expensive; however, I've also been told it's very strange and possibly wrong for tokenize to be inconsistent this way. If this is the right behavior, I guess I'd like it if it were documented. This sort of thing is confusing / potentially misleading for users of the tokenize module. Or at least, when I saw how single quotes were handled, I assumed incorrectly that all quotes were handled that way.

Tokenizing `' 1 2 3` versus `''' 1 2 3` yields different results.

Tokenizing `' 1 2 3` gives:

1,0-1,1:	ERRORTOKEN	"'"
1,2-1,3:	NUMBER	'1'
1,4-1,5:	NUMBER	'2'
1,6-1,7:	NUMBER	'3'
2,0-2,0:	ENDMARKER	''

while tokenizing `''' 1 2 3` yields:

Traceback (most recent call last):
  File "prog.py", line 4, in <module>
    tokenize.tokenize(iter(["''' 1 2 3"]).next)
  File "/usr/lib/python2.6/tokenize.py", line 169, in tokenize
    tokenize_loop(readline, tokeneater)
  File "/usr/lib/python2.6/tokenize.py", line 175, in tokenize_loop
    for token_info in generate_tokens(readline):
  File "/usr/lib/python2.6/tokenize.py", line 296, in generate_tokens
    raise TokenError, ("EOF in multi-line string", strstart)
tokenize.TokenError: ('EOF in multi-line string', (1, 0))


Apparently tokenize decides to re-tokenize after the erroneous quote in the case of a single-quote, but not a triple-quote. I guess that this is because retokenizing the rest of the file after an unclosed triple-quote would be expensive; however, I've also been told it's very strange and possibly wrong for tokenize to be inconsistent this way.

If this is the right behavior, I guess I'd like it if it were documented. This sort of thing is confusing / potentially misleading for users of the tokenize module. Or at least, when I saw how single quotes were handled, I assumed incorrectly that all quotes were handled that way.

History
Date	User	Action	Args
2011-05-12 14:19:30	Devin Jeanpierre	set	recipients: + Devin Jeanpierre
2011-05-12 14:19:30	Devin Jeanpierre	set	messageid: <1305209970.17.0.31373709531.issue12063@psf.upfronthosting.co.za>
2011-05-12 14:19:29	Devin Jeanpierre	link	issue12063 messages
2011-05-12 14:19:29	Devin Jeanpierre	create