Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize module w/ coding cookie #38293

Closed
warsaw opened this issue Apr 11, 2003 · 18 comments
Closed

tokenize module w/ coding cookie #38293

warsaw opened this issue Apr 11, 2003 · 18 comments
Assignees

Comments

@warsaw
Copy link
Member

warsaw commented Apr 11, 2003

BPO 719888
Nosy @loewis, @warsaw, @mdickinson, @tpn, @voidspace
Files
  • test_tokenize_patch.tar: test_tokenizer patches and supporting text files.
  • tokenize.zip: Changes to tokenize.py with tests for Python 3.
  • tokenize_patch.diff: Patch for tokenize, tests and standard library and tools usage of tokenize.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tpn'
    closed_at = <Date 2008-04-22.19:09:53.864>
    created_at = <Date 2003-04-11.19:24:37.000>
    labels = ['expert-unicode']
    title = 'tokenize module w/ coding cookie'
    updated_at = <Date 2008-04-22.19:09:53.862>
    user = 'https://github.com/warsaw'

    bugs.python.org fields:

    activity = <Date 2008-04-22.19:09:53.862>
    actor = 'trent'
    assignee = 'trent'
    closed = True
    closed_date = <Date 2008-04-22.19:09:53.864>
    closer = 'trent'
    components = ['Unicode']
    creation = <Date 2003-04-11.19:24:37.000>
    creator = 'barry'
    dependencies = []
    files = ['9686', '9735', '9741']
    hgrepos = []
    issue_num = 719888
    keywords = ['patch']
    message_count = 18.0
    messages = ['15444', '15445', '63612', '63618', '63619', '63620', '63949', '63951', '63953', '63955', '63957', '63959', '63980', '63982', '63990', '63998', '64006', '65679']
    nosy_count = 5.0
    nosy_names = ['loewis', 'barry', 'mark.dickinson', 'trent', 'michael.foord']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue719888'
    versions = ['Python 3.0']

    @warsaw
    Copy link
    Member Author

    warsaw commented Apr 11, 2003

    The tokenize module should honor the coding cookie in a
    file, probably so that it returns Unicode strings with
    decoded characters.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 12, 2006

    Logged In: YES
    user_id=21627

    I don't think I will do anything about this anytime soon, so
    unassigning myself.

    @mdickinson
    Copy link
    Member

    This issue is currently causing test_tokenize failures in Python 3.0.
    There are other ways to fix the test failures, but making tokenize honor
    the source file encoding seems like the right thing to do to me.

    Does this still seem like a good idea to everyone?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 17, 2008

    In 3k, the tokenize module should definitely return strings, and, in
    doing so, it should definitely consider the encoding declaration (and
    also the default encoding in absence of the encoding declaration).

    For 2.6, I wouldn't mind if it were changed incompatibly so that it
    returns Unicode strings, or else that it parses in Unicode, and then
    encodes back to the source encoding before returning anything.

    @tpn
    Copy link
    Member

    tpn commented Mar 17, 2008

    I've attached a patch to test_tokenizer.py and a bunch of text files
    (that should be dropped into Lib/test) that highlight this issue a
    *lot* better than the current state of affairs.

    The existing implementation defines roundup() in the doctest, then
    proceeds to define it again in the code body. The last for loop in the
    doctest is failing every so often -- what it's failing on isn't at all
    clear as a) ten random files are selected out of 332 in Lib/test, and
    b) there's no way of figuring out which files are causing it to fail
    unless you hack another method into the test case to try and replicate
    what the doctest is doing, with some additional print statements (which
    is the approach I took, only to get bitten by the fact that roundup()
    was being resolved to the bogus definition that's in the code body, not
    the functional one in the doctest, which resulted in even more
    misleading behaviour).

    FWIW, the file that causes the exception is test_doctest2.py as it
    contains encoded characters.

    So, the approach this patch takes is to drop the 'pick ten random test
    files and untokenize/tokenize' approach and add a class that
    specifically tests for the tokenizer's compliance with PEP-0263.

    I'll move on to a patch to tokenizer.py now, but this patch is ok to
    commit now -- it'll clean up the misleading errors being reported by
    the plethora of red 3.0 buildbots at the moment at the very least.

    @tpn
    Copy link
    Member

    tpn commented Mar 17, 2008

    Hmm, I take it multiple file uploads aren't supported. I don't want to
    use svn diff for the text files as it looks like it's butchering the
    bom encodings, so, tar it is! (Untar in root py3k/ directory.)

    @voidspace
    Copy link
    Contributor

    Made quite extensive changes to tokenize.py (with tests) for Py3k. This
    migrates it to a 'bytes' API so that it can correctly decode Python
    source files following PEP-0263.

    @mdickinson
    Copy link
    Member

    Michael, is the disappearance of the generate_tokens function in the new
    version of tokenize.py intentional?

    @voidspace
    Copy link
    Contributor

    That was 'by discussion with wiser heads than I'. The existing module
    has an old backwards compatibility interface called 'tokenize'. That can
    be deprecated in 2.6.

    As 'tokenize' is really the ideal name for the main entry point for the
    module, 'generate_tokens' became tokenize for Py3.

    @mdickinson
    Copy link
    Member

    Is it worth keeping generate_tokens as an alias for tokenize, just
    to avoid gratuitous 2-to-3 breakage? Maybe not---I guess they're
    different beasts, in that one wants a string-valued iterator and the
    other wants a bytes-valued iterator.

    So if I understand correctly, the readline argument to tokenize
    would have to return bytes instances. Would it be worth adding a check
    for this, to catch possible misuse? You could put the check in
    detect_encoding, so that just checks that the first one or two yields
    from readline have the correct type, and assumes that the rest is okay.

    @mdickinson
    Copy link
    Member

    Sorry---ignore the last comment; if readline() doesn't supply bytes
    then the line.decode('ascii') will fail with an AttributeError. So
    there won't be silent failure.

    I'll try thinking first and posting later next time.

    @tpn
    Copy link
    Member

    tpn commented Mar 18, 2008

    Tested patch on Win x86/x64 2k8, XP & FreeBSD 6.2, +1.

    @tpn tpn assigned tpn Mar 18, 2008
    @mdickinson
    Copy link
    Member

    With the patch,

    ./python.exe Lib/test/regrtest.py test_tokenize

    fails for me with the following output:

    Macintosh-2:py3k dickinsm$ ./python.exe Lib/test/regrtest.py test_tokenize
    test_tokenize
    test test_tokenize produced unexpected output:


    *** lines 2-5 of actual output doesn't appear in expected output after line 1:
    + testing: /Users/dickinsm/python_source/py3k/Lib/test/tokenize_tests-latin1-coding-cookie-and-utf8-bom-sig.txt
    + testing: /Users/dickinsm/python_source/py3k/Lib/test/tokenize_tests-no-coding-cookie-and-utf8-bom-sig-only.txt
    + testing: /Users/dickinsm/python_source/py3k/Lib/test/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txt
    + testing: /Users/dickinsm/python_source/py3k/Lib/test/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txt


    1 test failed:
    test_tokenize
    [65880 refs]

    I get something similar on Linux.

    @voidspace
    Copy link
    Contributor

    If you remove the following line from the tests (which generates
    spurious additional output on stdout) then the problem goes away:

            print('testing: %s' % path, end='\n')

    @voidspace
    Copy link
    Contributor

    *Full* patch (excluding the new dependent test text files) for Python 3.
    Includes fixes for standard library and tools usage of tokenize.

    If it breaks anything blame Trent... ;-)

    @mdickinson
    Copy link
    Member

    All tests pass for me on OS X 10.5.2 and SuSE Linux 10.2 (32-bit)!

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 18, 2008

    Is it worth keeping generate_tokens as an alias for tokenize, just
    to avoid gratuitous 2-to-3 breakage? Maybe not---I guess they're
    different beasts, in that one wants a string-valued iterator and the
    other wants a bytes-valued iterator.

    Exactly so - that was the primary rationale for renaming it. It
    shouldn't "silently" return something else, but there should be
    an explicit clue that you need to port actively.

    @tpn
    Copy link
    Member

    tpn commented Apr 22, 2008

    This was fixed in trunk in r61573, and merged to py3k in r61982.

    @tpn tpn closed this as completed Apr 22, 2008
    @tpn tpn closed this as completed Apr 22, 2008
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants