New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenize: add support for tokenizing 'str' objects #54178
Comments
Currently with 'py3k' only 'bytes' objects are accepted for tokenization: >>> import io
>>> import tokenize
>>> tokenize.tokenize(io.StringIO("1+1").readline)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 360, in tokenize
encoding, consumed = detect_encoding(readline)
File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 316, in detect_encoding
if first.startswith(BOM_UTF8):
TypeError: Can't convert 'bytes' object to str implicitly
>>> tokenize.tokenize(io.BytesIO(b"1+1").readline)
<generator object _tokenize at 0x1007566e0> In a discussion on python-dev (http://www.mail-archive.com/python-dev@python.org/msg52107.html) it was generally considered to be a good idea to add support for tokenizing 'str' objects as well. |
Note from Nick Coghlan from the Python-dev discussion: A very quick scan of _tokenize suggests it is designed to support An API that accepts a string, wraps a StringIO around it, then calls |
Possible approach (untested): def get_tokens(source):
if hasattr(source, "encode"):
# Already decoded, so bypass encoding detection
return _tokenize(io.StringIO(source).readline, None)
# Otherwise attempt to detect the correct encoding
return tokenize(io.BytesIO(source).readline) |
See also issue bpo-4626 which introduced PyCF_IGNORE_COOKIE and PyPARSE_IGNORE_COOKIE flags to support unicode string for the builtin compile() function. |
As per Antoine's comment on bpo-9873, requiring a real string via isinstance(source, str) to trigger the string IO version is likely to be cleaner than attempting to duck-type this. Strings are an area where we make so many assumptions about the way their internals work that duck-typing generally isn't all that effective. |
If the goal is tokenize(...) accepting a text I/O readline, we already have the (undocumented) generate_tokens(readline). |
The idea is bring the API up a level, and also take care of wrapping the file-like object around the source string/byte sequence. |
Attached is a first cut at a patch. |
I left some comments. Also, it would be nice to use the new function in the documentation example, which currently suggests tunnelling through UTF-8 but not adding an encoding comment. And see the patch for bpo-12486, which highlights a couple of other places that would benefit from this function. |
Actually maybe bpo-12486 is good enough to fix this too. With the patch proposed there, tokenize_basestring("source") would just be equivalent to tokenize(StringIO("source").readline) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: