This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Parsed expression has wrong col_offset when concatenating f-strings
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: problem with traceback for syntax error in f-string
View: 34364
Assigned To: Nosy List: eric.smith, gvanrossum, lys.nikolaou, pablogsal
Priority: normal Keywords:

Created on 2020-02-05 21:21 by lys.nikolaou, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (10)
msg361456 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-02-05 21:21
When concatenating f-strings, if there is an expression in any STRING node other than the first, col_offset of the parsed expression has a wrong value.

For example, parsing f"hello" f"{world}" outputs the following AST:

Module(
    body=[
        Expr(
            value=JoinedStr(
                values=[
                    Constant(
                        value="hello",
                        kind=None,
                        lineno=1,
                        col_offset=0,
                        end_lineno=1,
                        end_col_offset=19,
                    ),
                    FormattedValue(
                        value=Name(
                            id="world",
                            ctx=Load(),
                            lineno=1,
                            *col_offset=1,*
                            end_lineno=1,
                            *end_col_offset=6,*
                        ),
                        conversion=-1,
                        format_spec=None,
                        lineno=1,
                        col_offset=0,
                        end_lineno=1,
                        end_col_offset=19,
                    ),
                ],
                lineno=1,
                col_offset=0,
                end_lineno=1,
                end_col_offset=19,
            ),
            lineno=1,
            col_offset=0,
            end_lineno=1,
            end_col_offset=19,
        )
    ],
    type_ignores=[],
)

Here, col_offset and end_col_offset are wrong in the parsed NAME 'world'.
msg361457 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-02-05 22:05
There are a number of bugs about this. See issue 35212 and issue 34364, at least.

I have a stale patch that tried to fix this, but I was never able to get it completely correct.

I have another approach that I hope to discuss at the language summit at PyCon. Hopefully progress after that will be quick.

I'm going to close this as a duplicate. When I decide on a course of action, I'll either use issue 34364 or create a new issue.
msg361459 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-02-05 22:28
Note that this is somewhat urgent for oefen, since we strive to match the
produced AST exactly.
-- 
--Guido (mobile)
msg361462 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-02-06 00:48
This is indeed a duplicate of issue35212 (the second message there also shows a problem with implicit f-string concatenation). There's not much information in issue34364 (just a merged PR, and a mention of a PR of yours from 2 years ago that apparently never materialized).

FWIW I meant "pegen" (https://github.com/gvanrossum/pegen/) -- not sure why my mobile spell corrector turned that into "oefen" (maybe it's becoming multilingual :-).

And the reason I think called this "urgent" is that until this is fixed in CPython 3.8 we have to disable the part of our tests that compares the AST generated by pegen to the one generated by ast.parse. (We've found several other issues with incorrect column offsets in ast.parse this way, and always managed to fix them upstream. :-)

If we came up with a hacky fix, would you oppose incorporating it?

I'm guessing your alternate approach is modifying the tokenizer to treat f"x{y}z{a}b" as several tokens:

- f"x{
- y (or multiple tokens if there's a complex expression here)
- }z{
- a (same)
- }b"

(This has been proposed before by a few others, and IIRC we did it this way in ABC, in the early '80s.)
msg361467 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-02-06 01:48
Yes, my approach is to use the tokenizer to produce the parts of the f-string, instead of treating it as a normal string and making a second pass to parse out the pieces. I've been discussing this for a few years, and at last PyCon I talked to many tool vendors (editors, type checkers, etc.) to gather opinions, and no one was opposed. I've also had a few discussions since then, which were also positive. So I think I'll try to move ahead on it.

I'm not sure this is the best place to discuss it, but let me give a little example on why it's complicated.

Consider: f"x{y:.2f}z{a!s:10}b"

What type of token is ".2f", and how does the tokenizer recognize it? Or maybe the token is ":.2f". It can be an arbitrary sequence of characters (with a few restrictions, like no newlines), ending before a right brace. And there's the issue of backslash escapes there.

And the f"{z = :20}" thing we added doesn't make this any easier, since whitespace is significant after the "=" and before the ":", if it exists. And I need the entire text of the expression, before it's parsed, to make it work. I don't think that "unparsing" could ever be made to work in this case, or at least not so as to be compatible with 3.8 and preserving whitespace.

I really haven't looked at what's involved in the implementation in any greater depth: my first goal was (and will be) to get buy-off on the change, then to look at implementation issues. PEP 536 has some discussion of this, but if I go forward I would want to start from scratch (and reject PEP 536).

As far as the patch I was working on: the general problem is that you have to "prime" the compiler with a line number and column to start on, instead of it assuming line 1 column 1. Despite spending the better part of a week on it at the core sprints 1.5 years ago, I never got it to work (although I did delete and simplify a few functions, so it was a net win anyway). I wouldn't be opposed to a patch to fix this.
msg361468 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-02-06 02:58
Wow. Do you still have a branch with that non-working patch? Maybe there are some ideas that can be salvaged.

I'm wondering if maybe we're better leaving f-strings alone? The current code is quite voluminous, but it looks as if by the time you've addressed everything you mention using additional tokenizer state, you might end up with just as much or more convoluted code... F-strings totally rule, despite the current limitations!
msg361477 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-02-06 09:23
I'll see if I can dig up the patch today. If I can find it, I'll attach it to issue 34364.

This is really the first time I've tried to write down all of the issues related to tokenizing f-strings. It does seem a little daunting, but I'm not done noodling it through. At first blush it looks like the tokenizer would need to remember if it's inside an f-string or not and switch to different rules if so. Which doesn't exactly describe your average tokenizer, but I'm not sure how Python's tokenizer would need to be changed to deal with it, or how messy that change would be.

I should probably write an informational PEP about parsing f-strings. And I should include the reason I went with the "just a regular string which is later hand-parsed" approach: at the time, f-strings were a controversial topic (there were any number of reddit threads predicting doom and gloom if they were added). By parsing them as just regular strings with one simple added string prefix, it allowed existing tooling (editors, syntax highlighters, etc.) to easily skip over them just by recognizing 'f' as an additional string prefix.
msg361484 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-02-06 14:33
Also note that for pegen this is a solved problem. If we decide to change the parser, it may not be worth it to spend too much time on a universal solution.
msg361532 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-02-07 02:39
Can I ask how pegen solved this problem? Did you move parsing of f-strings into the grammar?

I found my patch and I'm working on resolving merge conflicts.
msg361535 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-02-07 03:32
> Can I ask how pegen solved this problem? Did you move parsing of f-strings into the grammar?

No, we actually do a very similar thing to what ast.c does.

I think there is only one major difference between what pegen does and what ast.c does. If I understand correctly, fstring_compile_expr calls fstring_fix_node_location with parent set to the whole STRING+ node. We OTOH only pass as parent the current STRING token (we work with tokens returned by the tokenizer) and then walk the AST tree and adjust the line/col_offset based on that. I'm not sure that this is bug-free, but after some testing, I could find any obvious errors.

If you want, you can take a loot at the code on https://github.com/gvanrossum/pegen/pull/183.
History
Date User Action Args
2022-04-11 14:59:26adminsetgithub: 83745
2020-02-07 03:32:04lys.nikolaousetmessages: + msg361535
2020-02-07 02:39:57eric.smithsetmessages: + msg361532
2020-02-06 14:33:55lys.nikolaousetmessages: + msg361484
2020-02-06 09:23:19eric.smithsetmessages: + msg361477
2020-02-06 02:58:13gvanrossumsetmessages: + msg361468
2020-02-06 01:48:00eric.smithsetmessages: + msg361467
2020-02-06 00:48:44gvanrossumsetmessages: + msg361462
2020-02-05 22:28:21gvanrossumsetmessages: + msg361459
2020-02-05 22:05:07eric.smithsetstatus: open -> closed

superseder: problem with traceback for syntax error in f-string
nosy: + eric.smith

messages: + msg361457
type: behavior
resolution: duplicate
stage: resolved
2020-02-05 21:23:24lys.nikolaousettitle: Parsed expression has wrong line/col info when concatenating f-strings -> Parsed expression has wrong col_offset when concatenating f-strings
2020-02-05 21:21:22lys.nikolaoucreate