classification
Title: tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode)
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Alexey.Umnov, barry, r.david.murray
Priority: normal Keywords:

Created on 2013-09-16 14:12 by Alexey.Umnov, last changed 2015-03-02 14:27 by barry.

Files
File name Uploaded Description Edit
tokens.txt Alexey.Umnov, 2013-09-16 14:12
Messages (3)
msg197899 - (view) Author: Alexey Umnov (Alexey.Umnov) Date: 2013-09-16 14:12
I execute the following code on the attached file 'text.txt':


import tokenize
import codecs

with open('text.txt', 'r') as f:
    reader = codecs.getreader('utf-8')(f)
    tokens = tokenize.generate_tokens(reader.readline)


The file 'text.txt' has the following structure: first line with some text, then '\f' symbol (0x0c) on the second line and then some text on the last line. The result is that the function 'generate_tokens' ignores everything after '\f'.

I've made some debugging and found out the following. If the file is read without using codecs (in ascii-mode), there are considered to be 3 lines in the file: 'text1\n', '\f\n', 'text2\n'. However in unicode-mode there are 4 lines: 'text1\n', '\f', '\n', 'text2\n'. I guess this is an intended behaviour since 2.7.x, but this causes a bug in tokenize module.

Consider the lines 317-329 in tokenize.py:

...
column = 0
while pos < max:                   # measure leading whitespace
    if line[pos] == ' ':
        column += 1
    elif line[pos] == '\t':
        column = (column//tabsize + 1)*tabsize
    elif line[pos] == '\f':
        column = 0
    else:
        break
    pos += 1
if pos == max:
    break
...

The last 'break' corresponds to the main parsing loop and makes the parsing stop. Thus the lines that consist of (' ', '\t', '\f') characters and don't end with '\n' are treated as the end of file.
msg197910 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-09-16 15:28
I suspect this isn't the only place where the change in what is considered a (unicode) line ending character between 2.6 and 2.7/python3 is an issue.  As you observe, it causes very subtle bugs.  I'm going to have to go trolling through the python3 email package looking for places where this could break things :(.
msg237044 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2015-03-02 14:27
Ha!  Apparently this bug broke coverage for the Mailman 3 source code:

https://bitbucket.org/ned/coveragepy/issue/360/html-reports-get-confused-by-l-in-the-code
History
Date User Action Args
2015-03-02 14:27:38barrysetmessages: + msg237044
2015-03-02 14:26:48barrysetnosy: + barry
2013-09-16 15:28:49r.david.murraysetnosy: + r.david.murray
messages: + msg197910
2013-09-16 14:12:54Alexey.Umnovcreate