classification
Title: [lib2to3] Synchronize token.py and tokenize.py with the standard library
Type: Stage: patch review
Components: 2to3 (2.x to 3.x conversion tool), Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: lukasz.langa, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2018-04-23 01:04 by lukasz.langa, last changed 2018-09-15 17:37 by monson.

Pull Requests
URL Status Linked Edit
PR 6572 open lukasz.langa, 2018-04-23 01:09
PR 6573 merged lukasz.langa, 2018-04-23 01:12
PR 8950 monson, 2018-09-15 17:37
Messages (5)
msg315639 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 01:04
lib2to3's token.py and tokenize.py were initially copies of the respective
files from the standard library.  They were copied to allow Python 3 to read
Python 2's grammar.

Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for
parsing Python 3 code.  Additions to support Python 3 grammar were added but
sadly, the main token.py and tokenize.py diverged.

This change brings them back together, minimizing the differences to the bare
minimum that is in fact required by lib2to3.  Before this change, almost every
line in lib2to3/pgen2/tokenize.py was different from tokenize.py.  After this
change, the diff between the two files is only 175 lines long and is entirely
filled with relevant Python 2 compatibility bits.

Merging the implementations, there's numerous fixes to the lib2to3 tokenizer:

+ docstrings made as similar as possible
+ ported `TokenInfo`
+ ported `tokenize.tokenize()` and `tokenize.open()`
+ removed Python 2-only implementation cruft
+ fixes Unicode identifier handling
+ fixes string prefix handling
+ fixes Ellipsis handling
+ Untokenizer backported bugfixes:
	- 5e6db313686c200da425a54d2e0c95fa40107b1d
	- 9dc3a36c849c15c227a8af218cfb215abe7b3c48
	- 5b8d2c3af76e704926cf5915ad0e6af59a232e61
	- e411b6629fb5f7bc01bec89df75737875ce6d8f5
        - BPO-2495
+ tokenizer doesn't crash on missing newline at the end of the
stream (added \Z (end of string) to PseudoExtras) - BPO-16152
+ `find_cookie` includes file name in error messages, if available
+ `find_cookie` raises SyntaxError on invalid encodings: BPO-14990

Improvements to lib2to3/pgen2/token.py:

+ taken from the current Lib/token.py
+ tokens renumbered to match Lib/token.py
+ `__all__` properly defined
+ ASYNC, AWAIT and BACKQUOTE exist under different numbers (100 + old number)
+ ELLIPSIS added
+ ENCODING added
msg315640 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 01:05
### Diff between files

The unified diff between tokenize implementations is here:
https://gist.github.com/ambv/679018041d85dd1a7497e6d89c45fb86

It clocks at 275 lines but that's because it gives context. The actual diff is
175 lines long.

To make it that small, I needed to move some insignificant bits in
Lib/tokenize.py.  This is what the other PR on this issue is about.
msg315650 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 08:07
New changeset c2d384dbd7c6ed9bdfaac45f05b463263c743ee7 by Łukasz Langa in branch 'master':
bpo-33338: [tokenize] Minor code cleanup (#6573)
https://github.com/python/cpython/commit/c2d384dbd7c6ed9bdfaac45f05b463263c743ee7
msg315802 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-04-26 14:50
It seems to me that regular expressions used in the lib2to3 version are more efficient but more complex.

$ ./python -m timeit -s 'import re; p = re.compile(r"0[bB](?:_?[01])+"); s = "0b"+"_0101"*16' 'p.match(s)'
100000 loops, best of 5: 2.45 usec per loop

$ ./python -m timeit -s 'import re; p = re.compile(r"0[bB]_?[01]+(?:_[01]+)*"); s = "0b"+"_0101"*16' 'p.match(s)'
200000 loops, best of 5: 1.08 usec per loop

$ ./python -m timeit -s 'import re; p = re.compile(r"0[xX](?:_?[0-9a-fA-F])+[lL]?"); s = "0x_0123_4567_89ab_cdef"' 'p.match(s)'
500000 loops, best of 5: 815 nsec per loop

$ ./python -m timeit -s 'import re; p = re.compile(r"0[xX]_?[\da-fA-F]+(?:_[\da-fA-F]+)*[lL]?"); s = "0x_0123_4567_89ab_cdef"' 'p.match(s)'
500000 loops, best of 5: 542 nsec per loop

Since the performance of lib2to3 is important, it is better to keep the current regexpes.

But using \d in Python 3 is a bug, it should be replaced with [0-9]. This also speeds up the regex:

$ ./python -m timeit -s 'import re; p = re.compile(r"0[xX]_?[0-9a-fA-F]+(?:_[0-9a-fA-F]+)*[lL]?"); s = "0x_0123_4567_89ab_cdef"' 'p.match(s)'
500000 loops, best of 5: 471 nsec per loop
msg315807 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-26 17:01
I agree with you Serhiy, there's a number things I want to make faster. But first I'd like to merge implementations so there is a clear one-way diff ("this is what we updated in lib2to3 to make it consistent it Lib/tokenize.py").  Then I want to optimize.
History
Date User Action Args
2018-09-15 17:37:33monsonsetpull_requests: + pull_request8757
2018-04-26 17:01:08lukasz.langasetmessages: + msg315807
2018-04-26 14:50:37serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg315802
2018-04-23 08:07:19lukasz.langasetmessages: + msg315650
2018-04-23 01:12:32lukasz.langasetpull_requests: + pull_request6274
2018-04-23 01:09:05lukasz.langasetkeywords: + patch
stage: patch review
pull_requests: + pull_request6269
2018-04-23 01:05:31lukasz.langasetmessages: + msg315640
2018-04-23 01:04:56lukasz.langacreate