Python tokenizer rewriting #69829

serhiy-storchaka · 2015-11-17T01:27:33Z

BPO	25643
Nosy	@brettcannon, @vstinner, @serhiy-storchaka, @1st1, @matrixise, @DimitrisJim, @pablogsal
PRs	bpo-25643: Refactor the C tokenizer #25050 bpo-25643: Fix tokenizer error when raw decoding null bytes #25080
Dependencies	bpo-26581: Double coding cookie
Files	tokenize_input.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2021-03-28.22:49:06.587>
created_at = <Date 2015-11-17.01:27:32.904>
labels = ['interpreter-core', 'type-bug', '3.7']
title = 'Python tokenizer rewriting'
updated_at = <Date 2021-03-29.21:53:38.609>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2021-03-29.21:53:38.609>
actor = 'pablogsal'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2021-03-28.22:49:06.587>
closer = 'pablogsal'
components = ['Interpreter Core']
creation = <Date 2015-11-17.01:27:32.904>
creator = 'serhiy.storchaka'
dependencies = ['26581']
files = ['41058']
hgrepos = []
issue_num = 25643
keywords = ['patch']
message_count = 7.0
messages = ['254778', '255082', '255355', '262091', '376742', '389654', '389692']
nosy_count = 8.0
nosy_names = ['brett.cannon', 'vstinner', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'matrixise', 'Jim Fasarakis-Hilliard', 'pablogsal']
pr_nums = ['25050', '25080']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25643'
versions = ['Python 3.7']

serhiy-storchaka · 2015-11-17T01:27:25Z

Here is preliminary patch that refactors the lowest level of Python tokenizer, reading and decoding. It splits the code on smaller simpler functions, decreases the source size by 37 lines, and fixes bugs: bpo-14811, bpo-18961, and a number of others. Added tests for most of fixed bugs (except leaks and others hardly reproducible). But the fix for other bugs can be harder, especially for issues with null byte (bpo-1105770, bpo-20115).

Many bug easily can be fixed if read all Python file in memory instead of reading it line by line. I don't know if it is acceptable.

matrixise · 2015-11-22T06:29:49Z

Hi Serhiy,

Just of your information but I think you know that, the tests pass ;-)

[398/399] test_multiprocessing_spawn (138 sec) -- running: test_tools
(108 sec)
[399/399] test_tools (121 sec)
385 tests OK.
3 tests altered the execution environment:
test___all__ test_site test_warnings
11 tests skipped:
test_devpoll test_kqueue test_msilib test_ossaudiodev
test_startfile test_tix test_tk test_ttk_guionly test_winreg
test_winsound test_zipfile64

But I am interested by this part of CPython, I am not an expert in
lexing and parsing but how can I help you ? I am a novice in this
domain.

Stephane

vstinner · 2015-11-25T14:17:10Z

"especially for issues with null byte"

I don't think that we should put to much energy in handling correctly NUL bytes. I see NUL bytes in code as bugs in the code, not in the Python parser. We *might* try to give warnings or better error messages to the user, that's all.

python-dev · 2016-03-20T21:30:29Z

New changeset 23a7481eafd4 by Serhiy Storchaka in branch 'default':
Issues bpo-25643, bpo-26581: Added new tests for detecting Python source code encoding.
https://hg.python.org/cpython/rev/23a7481eafd4

brettcannon · 2020-09-11T22:10:31Z

@serhiy: did you still want to commit this?

pablogsal · 2021-03-28T22:48:13Z

New changeset 261a452 by Pablo Galindo in branch 'master':
bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)
261a452

vstinner · 2021-03-29T12:28:06Z

Oh, 6 years to fix this bug. Better late than never ;-) Thanks for reporting and for fixing it!

serhiy-storchaka self-assigned this Nov 17, 2015

serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Nov 17, 2015

serhiy-storchaka added the 3.7 (EOL) end of life label Mar 14, 2017

pablogsal closed this as completed Mar 28, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python tokenizer rewriting #69829

Python tokenizer rewriting #69829

serhiy-storchaka commented Nov 17, 2015

serhiy-storchaka commented Nov 17, 2015

matrixise commented Nov 22, 2015

vstinner commented Nov 25, 2015

python-dev mannequin commented Mar 20, 2016

brettcannon commented Sep 11, 2020

pablogsal commented Mar 28, 2021

vstinner commented Mar 29, 2021

Python tokenizer rewriting #69829

Python tokenizer rewriting #69829

Comments

serhiy-storchaka commented Nov 17, 2015

serhiy-storchaka commented Nov 17, 2015

matrixise commented Nov 22, 2015

vstinner commented Nov 25, 2015

python-dev mannequin commented Mar 20, 2016

brettcannon commented Sep 11, 2020

pablogsal commented Mar 28, 2021

vstinner commented Mar 29, 2021