Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize: add support for tokenizing 'str' objects #54178

Open
meadori opened this issue Sep 28, 2010 · 11 comments
Open

tokenize: add support for tokenizing 'str' objects #54178

meadori opened this issue Sep 28, 2010 · 11 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@meadori
Copy link
Member

meadori commented Sep 28, 2010

BPO 9969
Nosy @ncoghlan, @vstinner, @voidspace, @meadori, @takluyver, @vadmium
Files
  • issue9969.patch: Patch against tip (3.3.0a0)
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2010-09-28.13:17:17.746>
    labels = ['type-feature', 'library']
    title = "tokenize: add support for tokenizing 'str' objects"
    updated_at = <Date 2018-05-17.20:48:30.205>
    user = 'https://github.com/meadori'

    bugs.python.org fields:

    activity = <Date 2018-05-17.20:48:30.205>
    actor = 'takluyver'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2010-09-28.13:17:17.746>
    creator = 'meador.inge'
    dependencies = []
    files = ['23099']
    hgrepos = []
    issue_num = 9969
    keywords = ['patch']
    message_count = 11.0
    messages = ['117516', '117523', '117554', '117571', '117652', '121712', '121843', '143506', '252299', '252303', '316983']
    nosy_count = 7.0
    nosy_names = ['ncoghlan', 'vstinner', 'michael.foord', 'meador.inge', 'ark3', 'takluyver', 'martin.panter']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue9969'
    versions = ['Python 3.4']

    @meadori
    Copy link
    Member Author

    meadori commented Sep 28, 2010

    Currently with 'py3k' only 'bytes' objects are accepted for tokenization:

    >>> import io
    >>> import tokenize
    >>> tokenize.tokenize(io.StringIO("1+1").readline)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 360, in tokenize
        encoding, consumed = detect_encoding(readline)
      File "/Users/minge/Code/python/py3k/Lib/tokenize.py", line 316, in detect_encoding
        if first.startswith(BOM_UTF8):
    TypeError: Can't convert 'bytes' object to str implicitly
    >>> tokenize.tokenize(io.BytesIO(b"1+1").readline)
    <generator object _tokenize at 0x1007566e0>

    In a discussion on python-dev (http://www.mail-archive.com/python-dev@python.org/msg52107.html) it was generally considered to be a good idea to add support for tokenizing 'str' objects as well.

    @meadori meadori added type-feature A feature request or enhancement stdlib Python modules in the Lib dir labels Sep 28, 2010
    @voidspace
    Copy link
    Contributor

    Note from Nick Coghlan from the Python-dev discussion:

    A very quick scan of _tokenize suggests it is designed to support
    detect_encoding returning None to indicate the line iterator will
    return already decoded lines. This is confirmed by the fact the
    standard library uses it that way (via generate_tokens).

    An API that accepts a string, wraps a StringIO around it, then calls
    _tokenise with an encoding of None would appear to be the answer here.
    A feature request on the tracker is the best way to make that happen.

    @ncoghlan
    Copy link
    Contributor

    Possible approach (untested):

    def get_tokens(source):
        if hasattr(source, "encode"):
            # Already decoded, so bypass encoding detection
            return _tokenize(io.StringIO(source).readline, None)
        # Otherwise attempt to detect the correct encoding
        return tokenize(io.BytesIO(source).readline)

    @vstinner
    Copy link
    Member

    See also issue bpo-4626 which introduced PyCF_IGNORE_COOKIE and PyPARSE_IGNORE_COOKIE flags to support unicode string for the builtin compile() function.

    @ncoghlan
    Copy link
    Contributor

    As per Antoine's comment on bpo-9873, requiring a real string via isinstance(source, str) to trigger the string IO version is likely to be cleaner than attempting to duck-type this. Strings are an area where we make so many assumptions about the way their internals work that duck-typing generally isn't all that effective.

    @ark3
    Copy link
    Mannequin

    ark3 mannequin commented Nov 20, 2010

    If the goal is tokenize(...) accepting a text I/O readline, we already have the (undocumented) generate_tokens(readline).

    @ncoghlan
    Copy link
    Contributor

    The idea is bring the API up a level, and also take care of wrapping the file-like object around the source string/byte sequence.

    @meadori
    Copy link
    Member Author

    meadori commented Sep 5, 2011

    Attached is a first cut at a patch.

    @vadmium
    Copy link
    Member

    vadmium commented Oct 5, 2015

    I left some comments. Also, it would be nice to use the new function in the documentation example, which currently suggests tunnelling through UTF-8 but not adding an encoding comment. And see the patch for bpo-12486, which highlights a couple of other places that would benefit from this function.

    @vadmium
    Copy link
    Member

    vadmium commented Oct 5, 2015

    Actually maybe bpo-12486 is good enough to fix this too. With the patch proposed there, tokenize_basestring("source") would just be equivalent to

    tokenize(StringIO("source").readline)

    @takluyver
    Copy link
    Mannequin

    takluyver mannequin commented May 17, 2018

    I've opened a PR for issue bpo-12486, which would make the existing but undocumented 'generate_tokens' function public:

    #6957

    I agree that it would be good to design a nicer API for this, but the perfect shouldn't be the enemy of the good.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants