Message 172244 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	nedbat
Recipients	nedbat
Date	2012-10-06.21:09:19
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1349557761.41.0.537472367724.issue16152@psf.upfronthosting.co.za>
In-reply-to

Content
When tokenizing with tokenize.generate_tokens, if the code ends with whitespace (no newline), the tokenizer produces an ERRORTOKEN for each space. Additionally, the regex that fails to find tokens in those spaces is linear in the number of spaces, so the overall performance is O(n*2). I found this while tokenizing code samples uploaded to a public web site. One sample for some reason ended with 40,000 spaces, which was taking two hours to tokenize. Demonstration: {{{ import token import tokenize try: from cStringIO import StringIO except: from io import StringIO code = "@"+" "10000 code_reader = StringIO(code).readline for num, (ttyp, ttok, _, _, _) in enumerate(tokenize.generate_tokens(code_reader)): print("%5d %15s %r" % (num, token.tok_name[ttyp], ttok)) }}}

When tokenizing with tokenize.generate_tokens, if the code ends with whitespace (no newline), the tokenizer produces an ERRORTOKEN for each space.  Additionally, the regex that fails to find tokens in those spaces is linear in the number of spaces, so the overall performance is O(n**2).

I found this while tokenizing code samples uploaded to a public web site.  One sample for some reason ended with 40,000 spaces, which was taking two hours to tokenize.

Demonstration:

{{{
import token
import tokenize

try:
    from cStringIO import StringIO
except:
    from io import StringIO

code = "@"+" "*10000
code_reader = StringIO(code).readline

for num, (ttyp, ttok, _, _, _) in enumerate(tokenize.generate_tokens(code_reader)):
    print("%5d %15s %r" % (num, token.tok_name[ttyp], ttok))
}}}

History
Date	User	Action	Args
2012-10-06 21:09:21	nedbat	set	recipients: + nedbat
2012-10-06 21:09:21	nedbat	set	messageid: <1349557761.41.0.537472367724.issue16152@psf.upfronthosting.co.za>
2012-10-06 21:09:21	nedbat	link	issue16152 messages
2012-10-06 21:09:19	nedbat	create