Message 209965 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	Arfrever, jaraco, terry.reedy
Date	2014-02-02.10:08:36
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1391335717.21.0.611324167521.issue20387@psf.upfronthosting.co.za>
In-reply-to

Content
I think the problem is with untokenize. s =b"if False:\n\tx=3\n\ty=3\n" t = tokenize(io.BytesIO(s).readline) for i in t: print(i) produces a token stream that seems correct. TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=1 (NAME), string='if', start=(1, 0), end=(1, 2), line='if False:\n') TokenInfo(type=1 (NAME), string='False', start=(1, 3), end=(1, 8), line='if False:\n') TokenInfo(type=52 (OP), string=':', start=(1, 8), end=(1, 9), line='if False:\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='if False:\n') TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), line='\tx=3\n') TokenInfo(type=1 (NAME), string='x', start=(2, 1), end=(2, 2), line='\tx=3\n') TokenInfo(type=52 (OP), string='=', start=(2, 2), end=(2, 3), line='\tx=3\n') TokenInfo(type=2 (NUMBER), string='3', start=(2, 3), end=(2, 4), line='\tx=3\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 4), end=(2, 5), line='\tx=3\n') TokenInfo(type=1 (NAME), string='y', start=(3, 1), end=(3, 2), line='\ty=3\n') TokenInfo(type=52 (OP), string='=', start=(3, 2), end=(3, 3), line='\ty=3\n') TokenInfo(type=2 (NUMBER), string='3', start=(3, 3), end=(3, 4), line='\ty=3\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 4), end=(3, 5), line='\ty=3\n') TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='') TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='') The problem with untokenize and indents is this: In the old untokenize duples function, now called 'compat', INDENT strings were added to a list and popped by the corresponding DEDENT. While compat has the minor problem of returning a string instead of bytes (which is actually as I think it should be) and adding extraneous spaces within and at the end of lines, it correctly handles tabs in your example and this: s =b"if False:\n\tx=1\n\t\ty=2\n\t\t\tz=3\n" t = tokenize(io.BytesIO(s).readline) print(untokenize(i[:2] for i in t).encode()) >>> b'if False :\n\tx =1 \n\t\ty =2 \n\t\t\tz =3 \n' When tokenize was changed to producing 5-tuples, untokenize was changed to use the start and end coordinates, but all special processing of indents was cut in favor of .add_space(). So this issue is a regression due in inadequate testing.

I think the problem is with untokenize.

s =b"if False:\n\tx=3\n\ty=3\n"
t = tokenize(io.BytesIO(s).readline)
for i in t: print(i)

produces a token stream that seems correct.

TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='if', start=(1, 0), end=(1, 2), line='if False:\n')
TokenInfo(type=1 (NAME), string='False', start=(1, 3), end=(1, 8), line='if False:\n')
TokenInfo(type=52 (OP), string=':', start=(1, 8), end=(1, 9), line='if False:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='if False:\n')
TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), line='\tx=3\n')
TokenInfo(type=1 (NAME), string='x', start=(2, 1), end=(2, 2), line='\tx=3\n')
TokenInfo(type=52 (OP), string='=', start=(2, 2), end=(2, 3), line='\tx=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(2, 3), end=(2, 4), line='\tx=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 4), end=(2, 5), line='\tx=3\n')
TokenInfo(type=1 (NAME), string='y', start=(3, 1), end=(3, 2), line='\ty=3\n')
TokenInfo(type=52 (OP), string='=', start=(3, 2), end=(3, 3), line='\ty=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(3, 3), end=(3, 4), line='\ty=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 4), end=(3, 5), line='\ty=3\n')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

The problem with untokenize and indents is this: In the old untokenize duples function, now called 'compat', INDENT strings were added to a list and popped by the corresponding DEDENT. While compat has the minor problem of returning a string instead of bytes (which is actually as I think it should be) and adding extraneous spaces within and at the end of lines, it correctly handles tabs in your example and this:

s =b"if False:\n\tx=1\n\t\ty=2\n\t\t\tz=3\n"
t = tokenize(io.BytesIO(s).readline)
print(untokenize(i[:2] for i in t).encode())
>>> 
b'if False :\n\tx =1 \n\t\ty =2 \n\t\t\tz =3 \n'

When tokenize was changed to producing 5-tuples, untokenize was changed to use the start and end coordinates, but all special processing of indents was cut in favor of .add_space(). So this issue is a regression due in inadequate testing.

History
Date	User	Action	Args
2014-02-02 10:08:37	terry.reedy	set	recipients: + terry.reedy, jaraco, Arfrever
2014-02-02 10:08:37	terry.reedy	set	messageid: <1391335717.21.0.611324167521.issue20387@psf.upfronthosting.co.za>
2014-02-02 10:08:37	terry.reedy	link	issue20387 messages
2014-02-02 10:08:36	terry.reedy	create