classification
Title: tokenize: add support for tokenizing 'str' objects
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ark3, martin.panter, meador.inge, michael.foord, ncoghlan, takluyver, vstinner
Priority: normal Keywords: patch

Created on 2010-09-28 13:17 by meador.inge, last changed 2018-05-17 20:48 by takluyver.

Files
File name Uploaded Description Edit
issue9969.patch meador.inge, 2011-09-05 02:11 Patch against tip (3.3.0a0) review
Messages (11)
msg117516 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2010-09-28 13:17
Currently with 'py3k' only 'bytes' objects are accepted for tokenization:

>>> import io
>>> import tokenize
>>> tokenize.tokenize(io.StringIO("1+1").readline)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 360, in tokenize
    encoding, consumed = detect_encoding(readline)
  File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 316, in detect_encoding
    if first.startswith(BOM_UTF8):
TypeError: Can't convert 'bytes' object to str implicitly
>>> tokenize.tokenize(io.BytesIO(b"1+1").readline)
<generator object _tokenize at 0x1007566e0>

In a discussion on python-dev (http://www.mail-archive.com/python-dev@python.org/msg52107.html) it was generally considered to be a good idea to add support for tokenizing 'str' objects as well.
msg117523 - (view) Author: Michael Foord (michael.foord) * (Python committer) Date: 2010-09-28 14:04
Note from Nick Coghlan from the Python-dev discussion:

A very quick scan of _tokenize suggests it is designed to support
detect_encoding returning None to indicate the line iterator will
return already decoded lines. This is confirmed by the fact the
standard library uses it that way (via generate_tokens).

An API that accepts a string, wraps a StringIO around it, then calls
_tokenise with an encoding of None would appear to be the answer here.
A feature request on the tracker is the best way to make that happen.
msg117554 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-28 21:54
Possible approach (untested):

def get_tokens(source):
    if hasattr(source, "encode"):
        # Already decoded, so bypass encoding detection
        return _tokenize(io.StringIO(source).readline, None)
    # Otherwise attempt to detect the correct encoding
    return tokenize(io.BytesIO(source).readline)
msg117571 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-29 01:06
See also issue #4626 which introduced PyCF_IGNORE_COOKIE and PyPARSE_IGNORE_COOKIE flags to support unicode string for the builtin compile() function.
msg117652 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-29 20:46
As per Antoine's comment on #9873, requiring a real string via isinstance(source, str) to trigger the string IO version is likely to be cleaner than attempting to duck-type this. Strings are an area where we make so many assumptions about the way their internals work that duck-typing generally isn't all that effective.
msg121712 - (view) Author: Abhay Saxena (ark3) Date: 2010-11-20 18:43
If the goal is tokenize(...) accepting a text I/O readline, we already have the (undocumented) generate_tokens(readline).
msg121843 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-11-21 02:54
The idea is bring the API up a level, and also take care of wrapping the file-like object around the source string/byte sequence.
msg143506 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011-09-05 02:11
Attached is a first cut at a patch.
msg252299 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-05 03:27
I left some comments. Also, it would be nice to use the new function in the documentation example, which currently suggests tunnelling through UTF-8 but not adding an encoding comment. And see the patch for Issue 12486, which highlights a couple of other places that would benefit from this function.
msg252303 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-05 05:14
Actually maybe Issue 12486 is good enough to fix this too. With the patch proposed there, tokenize_basestring("source") would just be equivalent to

tokenize(StringIO("source").readline)
msg316983 - (view) Author: Thomas Kluyver (takluyver) * Date: 2018-05-17 20:48
I've opened a PR for issue #12486, which would make the existing but undocumented 'generate_tokens' function public:

https://github.com/python/cpython/pull/6957

I agree that it would be good to design a nicer API for this, but the perfect shouldn't be the enemy of the good.
History
Date User Action Args
2018-05-17 20:48:30takluyversetmessages: + msg316983
2015-10-05 05:14:30martin.pantersetmessages: + msg252303
2015-10-05 03:27:26martin.pantersetnosy: + martin.panter
messages: + msg252299
2012-10-15 13:28:08serhiy.storchakasetversions: + Python 3.4, - Python 3.2, Python 3.3
2011-09-05 02:11:52meador.ingesetfiles: + issue9969.patch
keywords: + patch
messages: + msg143506

stage: needs patch -> patch review
2011-05-31 18:16:30takluyversetnosy: + takluyver
2010-11-21 02:54:03ncoghlansetmessages: + msg121843
2010-11-20 18:43:03ark3setnosy: + ark3
messages: + msg121712
2010-09-29 20:46:41ncoghlansetmessages: + msg117652
2010-09-29 01:06:53vstinnersetnosy: + vstinner
messages: + msg117571
2010-09-28 21:54:16ncoghlansetnosy: + ncoghlan
messages: + msg117554
2010-09-28 14:04:57michael.foordsetmessages: + msg117523
2010-09-28 13:34:51michael.foordsetnosy: + michael.foord
2010-09-28 13:18:28meador.ingesetcomponents: + Library (Lib)
2010-09-28 13:17:17meador.ingecreate