This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add f-string support to tokenize.py
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: eric.smith Nosy List: Nan Wu, eric.smith, martin.panter, python-dev, skrah
Priority: normal Keywords: easy

Created on 2015-10-04 17:23 by skrah, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tokenize.patch Nan Wu, 2015-10-07 17:02 review
issue25311.diff eric.smith, 2015-10-09 15:27 review
issue25311-1.diff eric.smith, 2015-10-20 17:22 review
Messages (13)
msg252274 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2015-10-04 17:23
I think tokenize.py needs to be updated to support f-strings.


BTW, the f-string implementation seems to be incredibly robust. Nice work!
msg252275 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-10-04 17:34
Thanks for noticing tokenize.py. And thanks for the kind note!
msg252295 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-05 00:42
I was just about to make the same bug report :) I guess it would be fine to tokenize F-strings as the same string objects as others, it probably just needs adding an F to the right regular expression.

$ ./python -btWall -m tokenize
"string"
1,0-1,8:            STRING         '"string"'     
1,8-1,9:            NEWLINE        '\n'           
b"string"
3,0-3,9:            STRING         'b"string"'    
3,9-3,10:           NEWLINE        '\n'           
f"string"
4,0-4,1:            NAME           'f'            
4,1-4,9:            STRING         '"string"'     
4,9-4,10:           NEWLINE        '\n'
msg252479 - (view) Author: Nan Wu (Nan Wu) * Date: 2015-10-07 17:02
Added 'f'/'F' to the StringPrefix regex and also update the quote dictionary.
msg252485 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-07 21:08
Thanks for the patch. Do you want to try adding a test case. See TokenizeTest.test_string() at /Lib/test/test_tokenize.py:189 for a guide, though I would suggest a new test_fstring() method.

Also, F-strings can be combined with the raw string syntax. I wonder if you need to add support for things like rf". . ." and FR'''. . .'''.
msg252522 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-10-08 09:28
Yes, both 'fr' and 'rf' need to be supported (and all upper/lower variants). And in the future, maybe 'fb' (and 'rfb', 'bfr', ...).

Unfortunately, the regex doesn't scale well for all of the combinations.
msg252610 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-10-09 13:34
I think the best way to approach this is to generate (in code) all of the places where string prefixes appear. There's StringPrefix, endpats, triple_quotes, and single_quoted.

With the currently valid combinations of f, b, r, and u, I count 24 combinations:
['B', 'BR', 'Br', 'F', 'FR', 'Fr', 'R', 'RB', 'RF', 'Rb', 'Rf', 'U', 'b', 'bR', 'br', 'f', 'fR', 'fr', 'r', 'rB', 'rF', 'rb', 'rf', 'u']

If I add "fb" strings (plus raw), I count 72 combinations:
['B', 'BFR', 'BFr', 'BR', 'BRF', 'BRf', 'BfR', 'Bfr', 'Br', 'BrF', 'Brf', 'F', 'FBR', 'FBr', 'FR', 'FRB', 'FRb', 'FbR', 'Fbr', 'Fr', 'FrB', 'Frb', 'R', 'RB', 'RBF', 'RBf', 'RF', 'RFB', 'RFb', 'Rb', 'RbF', 'Rbf', 'Rf', 'RfB', 'Rfb', 'U', 'b', 'bFR', 'bFr', 'bR', 'bRF', 'bRf', 'bfR', 'bfr', 'br', 'brF', 'brf', 'f', 'fBR', 'fBr', 'fR', 'fRB', 'fRb', 'fbR', 'fbr', 'fr', 'frB', 'frb', 'r', 'rB', 'rBF', 'rBf', 'rF', 'rFB', 'rFb', 'rb', 'rbF', 'rbf', 'rf', 'rfB', 'rfb', 'u']

Coding these combinations by hand seems insane.
msg252613 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-10-09 14:06
Oops, make that 80 combinations (I forgot the various 'fb' ones):

['B', 'BF', 'BFR', 'BFr', 'BR', 'BRF', 'BRf', 'Bf', 'BfR', 'Bfr', 'Br', 'BrF', 'Brf', 'F', 'FB', 'FBR', 'FBr', 'FR', 'FRB', 'FRb', 'Fb', 'FbR', 'Fbr', 'Fr', 'FrB', 'Frb', 'R', 'RB', 'RBF', 'RBf', 'RF', 'RFB', 'RFb', 'Rb', 'RbF', 'Rbf', 'Rf', 'RfB', 'Rfb', 'U', 'b', 'bF', 'bFR', 'bFr', 'bR', 'bRF', 'bRf', 'bf', 'bfR', 'bfr', 'br', 'brF', 'brf', 'f', 'fB', 'fBR', 'fBr', 'fR', 'fRB', 'fRb', 'fb', 'fbR', 'fbr', 'fr', 'frB', 'frb', 'r', 'rB', 'rBF', 'rBf', 'rF', 'rFB', 'rFb', 'rb', 'rbF', 'rbf', 'rf', 'rfB', 'rfb', 'u']


import itertools as _itertools

def _all_string_prefixes():
    # The valid string prefixes. Only contain the lower case versions,
    #  and don't contain any permuations (include 'fr', but not
    #  'rf'). The various permutations will be generated.
    _valid_string_prefixes = ['b', 'r', 'u', 'f', 'br', 'fr', 'fb', 'fbr']
    result = set()
    for prefix in _valid_string_prefixes:
        for t in _itertools.permutations(prefix):
            # create a list with upper and lower versions of each
            #  character
            for u in _itertools.product(*[(c, c.upper()) for c in t]):
                result.add(''.join(u))
    return result
msg252619 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-10-09 15:27
My first attempt. Many more tests are needed.

I'm going to need to spend some time trying to figure out how parts of tokenize.py actually works. I'm not sure, for example, that endpats is initialized correctly. There definitely aren't enough tests, since if I comment out parts of endpats the tests still pass.
msg253109 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-10-17 00:47
Multi-line string tests were added in changeset 91c44dc35dfd. That will make changes for this issue safer. Updated patch to come.
msg253236 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-10-20 17:22
This patch cleans up string matching in tokenize.py, and adds f-string support.
msg253461 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-10-26 08:38
New changeset 21f6c4378846 by Eric V. Smith in branch 'default':
Issue 25311: Add support for f-strings to tokenize.py. Also added some comments to explain what's happening, since it's not so obvious.
https://hg.python.org/cpython/rev/21f6c4378846
msg253463 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-10-26 08:44
I've fixed this particular problem, but the tokenize module definitely has some other issues. It recompiles regexes very often when it doesn't need to, it treats single- and triple-quoted strings differently (leading to some code bloat), etc. I may open another issue to address some of these problems.

And I'll be adding more tests. tokenize is still woefully under-tested.
History
Date User Action Args
2022-04-11 14:58:22adminsetgithub: 69498
2015-10-26 08:44:45eric.smithsetkeywords: - patch
status: open -> closed
stage: patch review -> resolved
2015-10-26 08:44:04eric.smithsetresolution: fixed
messages: + msg253463
2015-10-26 08:38:11python-devsetnosy: + python-dev
messages: + msg253461
2015-10-20 17:22:24eric.smithsetfiles: + issue25311-1.diff

messages: + msg253236
2015-10-17 00:47:53eric.smithsetmessages: + msg253109
2015-10-09 20:38:16@nkitsetnosy: - @nkit
2015-10-09 20:17:01@nkitsetnosy: + @nkit
2015-10-09 15:27:08eric.smithsetfiles: + issue25311.diff

messages: + msg252619
2015-10-09 14:06:43eric.smithsetmessages: + msg252613
2015-10-09 13:34:02eric.smithsetmessages: + msg252610
2015-10-08 09:28:38eric.smithsetmessages: + msg252522
2015-10-07 21:08:48martin.pantersetmessages: + msg252485
stage: needs patch -> patch review
2015-10-07 17:02:48Nan Wusetfiles: + tokenize.patch

nosy: + Nan Wu
messages: + msg252479

keywords: + patch
2015-10-05 00:42:55martin.pantersetkeywords: + easy
nosy: + martin.panter
messages: + msg252295

2015-10-04 17:34:49eric.smithsetassignee: eric.smith
messages: + msg252275
2015-10-04 17:23:27skrahcreate