classification
Title: Trailing whitespace makes tokenize.generate_tokens pathological
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, haypo, jcea, nedbat, python-dev
Priority: normal Keywords: patch

Created on 2012-10-06 21:09 by nedbat, last changed 2012-11-03 15:53 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
bug16152_v33.patch nedbat, 2012-10-07 00:47 Patch for 3.3 review
bug16152_v27.patch nedbat, 2012-10-07 00:48 Patch for 2.7 review
Messages (7)
msg172244 - (view) Author: Ned Batchelder (nedbat) * Date: 2012-10-06 21:09
When tokenizing with tokenize.generate_tokens, if the code ends with whitespace (no newline), the tokenizer produces an ERRORTOKEN for each space.  Additionally, the regex that fails to find tokens in those spaces is linear in the number of spaces, so the overall performance is O(n**2).

I found this while tokenizing code samples uploaded to a public web site.  One sample for some reason ended with 40,000 spaces, which was taking two hours to tokenize.

Demonstration:

{{{
import token
import tokenize

try:
    from cStringIO import StringIO
except:
    from io import StringIO

code = "@"+" "*10000
code_reader = StringIO(code).readline

for num, (ttyp, ttok, _, _, _) in enumerate(tokenize.generate_tokens(code_reader)):
    print("%5d %15s %r" % (num, token.tok_name[ttyp], ttok))
}}}
msg172246 - (view) Author: Ned Batchelder (nedbat) * Date: 2012-10-06 21:15
Here's a patch for 3.3.

I would like to also fix 2.7...
msg172276 - (view) Author: Ned Batchelder (nedbat) * Date: 2012-10-07 00:49
Updated with new (better) patch, for v2.7 and v3.3.  They are the same except for the test.
msg172546 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-10-10 01:04
Ned, could you possibly send a Contributor Form Agreement? http://www.python.org/psf/contrib/
msg172548 - (view) Author: Ned Batchelder (nedbat) * Date: 2012-10-10 02:05
Jesús, done!
msg174640 - (view) Author: Roundup Robot (python-dev) Date: 2012-11-03 15:51
New changeset eb7ea51e658e by Ezio Melotti in branch '2.7':
#16152: fix tokenize to ignore whitespace at the end of the code when no newline is found.  Patch by Ned Batchelder.
http://hg.python.org/cpython/rev/eb7ea51e658e

New changeset 3ffff1798ed5 by Ezio Melotti in branch '3.2':
#16152: fix tokenize to ignore whitespace at the end of the code when no newline is found.  Patch by Ned Batchelder.
http://hg.python.org/cpython/rev/3ffff1798ed5

New changeset 1fdeddabddda by Ezio Melotti in branch '3.3':
#16152: merge with 3.2.
http://hg.python.org/cpython/rev/1fdeddabddda

New changeset ed091424f230 by Ezio Melotti in branch 'default':
#16152: merge with 3.3.
http://hg.python.org/cpython/rev/ed091424f230
msg174641 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-11-03 15:53
Fixed, thanks for the report and the patch!
History
Date User Action Args
2012-11-03 15:53:58ezio.melottisetstatus: open -> closed

type: behavior
assignee: ezio.melotti
versions: + Python 3.2
nosy: + ezio.melotti

messages: + msg174641
resolution: fixed
stage: patch review -> resolved
2012-11-03 15:51:38python-devsetnosy: + python-dev
messages: + msg174640
2012-10-10 02:05:40nedbatsetmessages: + msg172548
2012-10-10 01:07:46jceasetversions: + Python 3.4
2012-10-10 01:04:02jceasetnosy: + jcea
messages: + msg172546
2012-10-07 00:49:02nedbatsetmessages: + msg172276
2012-10-07 00:48:04nedbatsetfiles: + bug16152_v27.patch
2012-10-07 00:47:40nedbatsetfiles: + bug16152_v33.patch
2012-10-07 00:44:39nedbatsetfiles: - bug16152.patch
2012-10-06 21:15:50nedbatsetfiles: + bug16152.patch
keywords: + patch
messages: + msg172246
2012-10-06 21:12:12hayposetnosy: + haypo
2012-10-06 21:09:21nedbatcreate