Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python tokenizer rewriting #69829

Closed
serhiy-storchaka opened this issue Nov 17, 2015 · 7 comments
Closed

Python tokenizer rewriting #69829

serhiy-storchaka opened this issue Nov 17, 2015 · 7 comments
Assignees
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 25643
Nosy @brettcannon, @vstinner, @serhiy-storchaka, @1st1, @matrixise, @DimitrisJim, @pablogsal
PRs
  • bpo-25643: Refactor the C tokenizer #25050
  • bpo-25643: Fix tokenizer error when raw decoding null bytes #25080
  • Dependencies
  • bpo-26581: Double coding cookie
  • Files
  • tokenize_input.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2021-03-28.22:49:06.587>
    created_at = <Date 2015-11-17.01:27:32.904>
    labels = ['interpreter-core', 'type-bug', '3.7']
    title = 'Python tokenizer rewriting'
    updated_at = <Date 2021-03-29.21:53:38.609>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2021-03-29.21:53:38.609>
    actor = 'pablogsal'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2021-03-28.22:49:06.587>
    closer = 'pablogsal'
    components = ['Interpreter Core']
    creation = <Date 2015-11-17.01:27:32.904>
    creator = 'serhiy.storchaka'
    dependencies = ['26581']
    files = ['41058']
    hgrepos = []
    issue_num = 25643
    keywords = ['patch']
    message_count = 7.0
    messages = ['254778', '255082', '255355', '262091', '376742', '389654', '389692']
    nosy_count = 8.0
    nosy_names = ['brett.cannon', 'vstinner', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'matrixise', 'Jim Fasarakis-Hilliard', 'pablogsal']
    pr_nums = ['25050', '25080']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue25643'
    versions = ['Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    Here is preliminary patch that refactors the lowest level of Python tokenizer, reading and decoding. It splits the code on smaller simpler functions, decreases the source size by 37 lines, and fixes bugs: bpo-14811, bpo-18961, and a number of others. Added tests for most of fixed bugs (except leaks and others hardly reproducible). But the fix for other bugs can be harder, especially for issues with null byte (bpo-1105770, bpo-20115).

    Many bug easily can be fixed if read all Python file in memory instead of reading it line by line. I don't know if it is acceptable.

    @serhiy-storchaka serhiy-storchaka self-assigned this Nov 17, 2015
    @serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Nov 17, 2015
    @matrixise
    Copy link
    Member

    Hi Serhiy,

    Just of your information but I think you know that, the tests pass ;-)

    [398/399] test_multiprocessing_spawn (138 sec) -- running: test_tools
    (108 sec)
    [399/399] test_tools (121 sec)
    385 tests OK.
    3 tests altered the execution environment:
    test___all__ test_site test_warnings
    11 tests skipped:
    test_devpoll test_kqueue test_msilib test_ossaudiodev
    test_startfile test_tix test_tk test_ttk_guionly test_winreg
    test_winsound test_zipfile64

    But I am interested by this part of CPython, I am not an expert in
    lexing and parsing but how can I help you ? I am a novice in this
    domain.

    Stephane

    @vstinner
    Copy link
    Member

    "especially for issues with null byte"

    I don't think that we should put to much energy in handling correctly NUL bytes. I see NUL bytes in code as bugs in the code, not in the Python parser. We *might* try to give warnings or better error messages to the user, that's all.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 20, 2016

    New changeset 23a7481eafd4 by Serhiy Storchaka in branch 'default':
    Issues bpo-25643, bpo-26581: Added new tests for detecting Python source code encoding.
    https://hg.python.org/cpython/rev/23a7481eafd4

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Mar 14, 2017
    @brettcannon
    Copy link
    Member

    @serhiy: did you still want to commit this?

    @pablogsal
    Copy link
    Member

    New changeset 261a452 by Pablo Galindo in branch 'master':
    bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)
    261a452

    @vstinner
    Copy link
    Member

    Oh, 6 years to fix this bug. Better late than never ;-) Thanks for reporting and for fixing it!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants