This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Python tokenizer rewriting
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: 26581 Superseder:
Assigned To: serhiy.storchaka Nosy List: Jim Fasarakis-Hilliard, brett.cannon, matrixise, pablogsal, python-dev, serhiy.storchaka, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2015-11-17 01:27 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tokenize_input.patch serhiy.storchaka, 2015-11-17 01:27 review
Pull Requests
URL Status Linked Edit
PR 25050 merged pablogsal, 2021-03-28 04:12
PR 25080 merged pablogsal, 2021-03-29 21:53
Messages (7)
msg254778 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-17 01:27
Here is preliminary patch that refactors the lowest level of Python tokenizer, reading and decoding. It splits the code on smaller simpler functions, decreases the source size by 37 lines, and fixes bugs: issue14811, issue18961, and a number of others. Added tests for most of fixed bugs (except leaks and others hardly reproducible). But the fix for other bugs can be harder, especially for issues with null byte (issue1105770, issue20115).

Many bug easily can be fixed if read all Python file in memory instead of reading it line by line. I don't know if it is acceptable.
msg255082 - (view) Author: Stéphane Wirtel (matrixise) * (Python committer) Date: 2015-11-22 06:29
Hi Serhiy,

Just of your information but I think you know that, the tests pass ;-)

[398/399] test_multiprocessing_spawn (138 sec) -- running: test_tools
(108 sec)
[399/399] test_tools (121 sec)
385 tests OK.
3 tests altered the execution environment:
    test___all__ test_site test_warnings
11 tests skipped:
    test_devpoll test_kqueue test_msilib test_ossaudiodev
    test_startfile test_tix test_tk test_ttk_guionly test_winreg
    test_winsound test_zipfile64

But I am interested by this part of CPython, I am not an expert in
lexing and parsing but how can I help you ? I am a novice in this
domain.

Stephane
msg255355 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-11-25 14:17
"especially for issues with null byte"

I don't think that we should put to much energy in handling correctly NUL bytes. I see NUL bytes in code as bugs in the code, not in the Python parser. We *might* try to give warnings or better error messages to the user, that's all.
msg262091 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-03-20 21:30
New changeset 23a7481eafd4 by Serhiy Storchaka in branch 'default':
Issues #25643, #26581: Added new tests for detecting Python source code encoding.
https://hg.python.org/cpython/rev/23a7481eafd4
msg376742 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2020-09-11 22:10
@serhiy: did you still want to commit this?
msg389654 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-03-28 22:48
New changeset 261a452a1300eeeae1428ffd6e6623329c085e2c by Pablo Galindo in branch 'master':
bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)
https://github.com/python/cpython/commit/261a452a1300eeeae1428ffd6e6623329c085e2c
msg389692 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 12:28
Oh, 6 years to fix this bug. Better late than never ;-) Thanks for reporting and for fixing it!
History
Date User Action Args
2022-04-11 14:58:23adminsetgithub: 69829
2021-04-13 17:07:04vstinnerlinkissue14811 superseder
2021-03-29 21:53:38pablogsalsetpull_requests: + pull_request23830
2021-03-29 12:28:05vstinnersetmessages: + msg389692
2021-03-28 22:49:06pablogsalsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-03-28 22:48:13pablogsalsetmessages: + msg389654
2021-03-28 04:12:55pablogsalsetkeywords: + patch
nosy: + pablogsal

pull_requests: + pull_request23799
stage: patch review
2020-09-11 22:10:31brett.cannonsetmessages: + msg376742
2017-03-14 14:57:52serhiy.storchakasetkeywords: - patch
versions: + Python 3.7, - Python 3.6
2017-03-14 14:29:12Jim Fasarakis-Hilliardsetnosy: + Jim Fasarakis-Hilliard
2017-03-14 13:52:27serhiy.storchakalinkissue3353 dependencies
2016-03-20 21:30:29python-devsetnosy: + python-dev
messages: + msg262091
2016-03-17 12:04:22serhiy.storchakasetdependencies: + Double coding cookie
2015-11-25 14:17:10vstinnersetnosy: + vstinner
messages: + msg255355
2015-11-22 06:29:50matrixisesetmessages: + msg255082
2015-11-22 04:47:25matrixisesetnosy: + matrixise
2015-11-17 17:42:47brett.cannonsetnosy: + brett.cannon
2015-11-17 17:22:56yselivanovsetnosy: + yselivanov
2015-11-17 01:27:33serhiy.storchakacreate