Issue 25311: Add f-string support to tokenize.py

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69498

classification

Title:	Add f-string support to tokenize.py
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	eric.smith	Nosy List:	Nan Wu, eric.smith, martin.panter, python-dev, skrah
Priority:	normal	Keywords:	easy

Created on 2015-10-04 17:23 by skrah, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
tokenize.patch	Nan Wu, 2015-10-07 17:02		review
issue25311.diff	eric.smith, 2015-10-09 15:27		review
issue25311-1.diff	eric.smith, 2015-10-20 17:22		review

Messages (13)
msg252274 - (view)	Author: Stefan Krah (skrah) *	Date: 2015-10-04 17:23
I think tokenize.py needs to be updated to support f-strings. BTW, the f-string implementation seems to be incredibly robust. Nice work!
msg252275 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2015-10-04 17:34
Thanks for noticing tokenize.py. And thanks for the kind note!
msg252295 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-10-05 00:42
I was just about to make the same bug report :) I guess it would be fine to tokenize F-strings as the same string objects as others, it probably just needs adding an F to the right regular expression. $ ./python -btWall -m tokenize "string" 1,0-1,8: STRING '"string"' 1,8-1,9: NEWLINE '\n' b"string" 3,0-3,9: STRING 'b"string"' 3,9-3,10: NEWLINE '\n' f"string" 4,0-4,1: NAME 'f' 4,1-4,9: STRING '"string"' 4,9-4,10: NEWLINE '\n'
msg252479 - (view)	Author: Nan Wu (Nan Wu) *	Date: 2015-10-07 17:02
Added 'f'/'F' to the StringPrefix regex and also update the quote dictionary.
msg252485 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-10-07 21:08
Thanks for the patch. Do you want to try adding a test case. See TokenizeTest.test_string() at /Lib/test/test_tokenize.py:189 for a guide, though I would suggest a new test_fstring() method. Also, F-strings can be combined with the raw string syntax. I wonder if you need to add support for things like rf". . ." and FR'''. . .'''.
msg252522 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2015-10-08 09:28
Yes, both 'fr' and 'rf' need to be supported (and all upper/lower variants). And in the future, maybe 'fb' (and 'rfb', 'bfr', ...). Unfortunately, the regex doesn't scale well for all of the combinations.
msg252610 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2015-10-09 13:34
I think the best way to approach this is to generate (in code) all of the places where string prefixes appear. There's StringPrefix, endpats, triple_quotes, and single_quoted. With the currently valid combinations of f, b, r, and u, I count 24 combinations: ['B', 'BR', 'Br', 'F', 'FR', 'Fr', 'R', 'RB', 'RF', 'Rb', 'Rf', 'U', 'b', 'bR', 'br', 'f', 'fR', 'fr', 'r', 'rB', 'rF', 'rb', 'rf', 'u'] If I add "fb" strings (plus raw), I count 72 combinations: ['B', 'BFR', 'BFr', 'BR', 'BRF', 'BRf', 'BfR', 'Bfr', 'Br', 'BrF', 'Brf', 'F', 'FBR', 'FBr', 'FR', 'FRB', 'FRb', 'FbR', 'Fbr', 'Fr', 'FrB', 'Frb', 'R', 'RB', 'RBF', 'RBf', 'RF', 'RFB', 'RFb', 'Rb', 'RbF', 'Rbf', 'Rf', 'RfB', 'Rfb', 'U', 'b', 'bFR', 'bFr', 'bR', 'bRF', 'bRf', 'bfR', 'bfr', 'br', 'brF', 'brf', 'f', 'fBR', 'fBr', 'fR', 'fRB', 'fRb', 'fbR', 'fbr', 'fr', 'frB', 'frb', 'r', 'rB', 'rBF', 'rBf', 'rF', 'rFB', 'rFb', 'rb', 'rbF', 'rbf', 'rf', 'rfB', 'rfb', 'u'] Coding these combinations by hand seems insane.
msg252613 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2015-10-09 14:06
Oops, make that 80 combinations (I forgot the various 'fb' ones): ['B', 'BF', 'BFR', 'BFr', 'BR', 'BRF', 'BRf', 'Bf', 'BfR', 'Bfr', 'Br', 'BrF', 'Brf', 'F', 'FB', 'FBR', 'FBr', 'FR', 'FRB', 'FRb', 'Fb', 'FbR', 'Fbr', 'Fr', 'FrB', 'Frb', 'R', 'RB', 'RBF', 'RBf', 'RF', 'RFB', 'RFb', 'Rb', 'RbF', 'Rbf', 'Rf', 'RfB', 'Rfb', 'U', 'b', 'bF', 'bFR', 'bFr', 'bR', 'bRF', 'bRf', 'bf', 'bfR', 'bfr', 'br', 'brF', 'brf', 'f', 'fB', 'fBR', 'fBr', 'fR', 'fRB', 'fRb', 'fb', 'fbR', 'fbr', 'fr', 'frB', 'frb', 'r', 'rB', 'rBF', 'rBf', 'rF', 'rFB', 'rFb', 'rb', 'rbF', 'rbf', 'rf', 'rfB', 'rfb', 'u'] import itertools as _itertools def _all_string_prefixes(): # The valid string prefixes. Only contain the lower case versions, # and don't contain any permuations (include 'fr', but not # 'rf'). The various permutations will be generated. _valid_string_prefixes = ['b', 'r', 'u', 'f', 'br', 'fr', 'fb', 'fbr'] result = set() for prefix in _valid_string_prefixes: for t in _itertools.permutations(prefix): # create a list with upper and lower versions of each # character for u in _itertools.product(*[(c, c.upper()) for c in t]): result.add(''.join(u)) return result
msg252619 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2015-10-09 15:27
My first attempt. Many more tests are needed. I'm going to need to spend some time trying to figure out how parts of tokenize.py actually works. I'm not sure, for example, that endpats is initialized correctly. There definitely aren't enough tests, since if I comment out parts of endpats the tests still pass.
msg253109 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2015-10-17 00:47
Multi-line string tests were added in changeset 91c44dc35dfd. That will make changes for this issue safer. Updated patch to come.
msg253236 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2015-10-20 17:22
This patch cleans up string matching in tokenize.py, and adds f-string support.
msg253461 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-10-26 08:38
New changeset 21f6c4378846 by Eric V. Smith in branch 'default': Issue 25311: Add support for f-strings to tokenize.py. Also added some comments to explain what's happening, since it's not so obvious. https://hg.python.org/cpython/rev/21f6c4378846
msg253463 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2015-10-26 08:44
I've fixed this particular problem, but the tokenize module definitely has some other issues. It recompiles regexes very often when it doesn't need to, it treats single- and triple-quoted strings differently (leading to some code bloat), etc. I may open another issue to address some of these problems. And I'll be adding more tests. tokenize is still woefully under-tested.

History
Date	User	Action	Args
2022-04-11 14:58:22	admin	set	github: 69498
2015-10-26 08:44:45	eric.smith	set	keywords: - patch status: open -> closed stage: patch review -> resolved
2015-10-26 08:44:04	eric.smith	set	resolution: fixed messages: + msg253463
2015-10-26 08:38:11	python-dev	set	nosy: + python-dev messages: + msg253461
2015-10-20 17:22:24	eric.smith	set	files: + issue25311-1.diff messages: + msg253236
2015-10-17 00:47:53	eric.smith	set	messages: + msg253109
2015-10-09 20:38:16	@nkit	set	nosy: - @nkit
2015-10-09 20:17:01	@nkit	set	nosy: + @nkit
2015-10-09 15:27:08	eric.smith	set	files: + issue25311.diff messages: + msg252619
2015-10-09 14:06:43	eric.smith	set	messages: + msg252613
2015-10-09 13:34:02	eric.smith	set	messages: + msg252610
2015-10-08 09:28:38	eric.smith	set	messages: + msg252522
2015-10-07 21:08:48	martin.panter	set	messages: + msg252485 stage: needs patch -> patch review
2015-10-07 17:02:48	Nan Wu	set	files: + tokenize.patch nosy: + Nan Wu messages: + msg252479 keywords: + patch
2015-10-05 00:42:55	martin.panter	set	keywords: + easy nosy: + martin.panter messages: + msg252295
2015-10-04 17:34:49	eric.smith	set	assignee: eric.smith messages: + msg252275
2015-10-04 17:23:27	skrah	create