classification
Title: Handle annotations in the parser to avoid the need for roundtrip
Type: Stage:
Components: Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BTaskaya, eric.smith, gvanrossum, lukasz.langa, lys.nikolaou, pablogsal
Priority: normal Keywords: patch

Created on 2020-10-07 12:13 by pablogsal, last changed 2020-10-08 00:19 by eric.smith.

Pull Requests
URL Status Linked Edit
PR 22587 closed pablogsal, 2020-10-07 12:14
Messages (24)
msg378158 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 12:13
For Python 3.10 now that annotations are always returned as strings, we can drop all the unparse logic by obtaining the string in the parser directly (after checking that the annotation is a valid expression).
msg378159 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 12:15
I have attached a draft PR for this:

https://github.com/python/cpython/pull/22587

I think this may be a considerable win as the diff is +66 −1,013
msg378164 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 15:03
After investigating with BTaskaya, seems that we cannot use the literal string from the tokenizer for multi-line annotations. But we could use the tokens (losing the literal whitespace), which honestly is less exciting because we will lose the literal formatting.
msg378165 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 15:16
On the other hand, multi-line annotations are rare enough we could consider falling back to token reconstruction then (the produced string will miss all whitespace in that case).
msg378166 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2020-10-07 15:21
After scanning 1552 annotated assignments in over 25k python modules (from most populer packages in PyPI) there are only 6 AnnAssign nodes where the annotations are multiline. For 516389 annotated arguments, only 20 of them are multiline annotations.
msg378168 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2020-10-07 16:03
s/For 516389 annotated arguments/For 516389 arguments/.

Example usages;
https://search.tree.science/?query=AnnAssign(annotation%20=%20expr(lineno%20=%20not%20~end_lineno))

https://search.tree.science/?query=arg(annotation=expr(lineno%20=%20not%20~end_lineno))
msg378172 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-10-07 17:14
What's the issue exactly with multi-line annotations? Is it that they can't be parsed by eval()? Could that be fixed by replacing comments and newlines with spaces? (Either in the parser or in get_type_hints() -- actually ForwardRef._evaluate(), which calls eval().)
msg378174 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 18:39
> What's the issue exactly with multi-line annotations? 

The issue is that for grabbing the source, we are inspecting the tokenize buffer and using the information in the expr node of the annotation to know what we need to pick from it.

In multi-line annotations, the tokenize buffer does not contain the whole source, it just contains the last line.

To make it work, we would need to grab the individual tokens (because the tokenizer has all of them) and append them together to form the expression, but in this process we lose the non-significant whitespace.

---

Regarding loosing the whitespace for multi-line strings (or all string if we want to be consistent): the more I think about it the more I think is not that bad. This means that the string that the parser will place in the annotation field (and therefore the one that ends in __annotations__) will not have spaces, but the ast will be able to parse it without any problems. It will just look a bit crumbled, but that field is supposed to be used by tools, not humans, so I don't think is a huge problem.
msg378175 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-10-07 19:36
Hm, this buffer thing is not what I had expected. I think it's a show-stopper -- if we don't reproduce spaces exactly.

Maybe we should just buffer the entire input always?

A way to handle multi-line annotations would be to surround them with parentheses? That way get_type_hints() doesn't need to change. Alternatively, strip comments and newlines in get_type_hints().
msg378176 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2020-10-07 19:42
It is not actually much of a difference;
 $ cat t.py
def func(
    arg: typing.Callable[
        [int, str],
        str # test
    ]
):
    pass

import typing
print("Annotations: ", func.__annotations__)
print("get_type_hints: ", typing.get_type_hints(func))
$ ./python t.py
Annotations:  {'arg': 'typing.Callable[[int,str],str]'}
get_type_hints:  {'arg': typing.Callable[[int, str], str]}

$ ./python
>>> foo: (List[List[List[int, str, bytes]], List[(int, str)]])
>>> __annotations__
{'foo': 'List[List[List[int,str,bytes]],List[(int,str)]]'}
msg378177 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 20:41
> Hm, this buffer thing is not what I had expected. I think it's a show-stopper -- if we don't reproduce spaces exactly.

Hummm, I think I may have not clarified this in my previous messages: the space issue is an aesthetic one. The AST generated with and without spaces is the same. As far as I understand get_type_hints() will give the same result in all cases being considered.
msg378179 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-10-07 20:57
I know the final result is the same (i.e. the AST is identical and the
string differs only in the use of whitespace). I just wonder why bother
changing this unless we reproduce the spaces in the string exactly.
msg378180 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 21:16
>  I just wonder why bother
changing this unless we reproduce the spaces in the string exactly.

Maintenance reasons: it removes over 1000 lines of hand written code. Additionally, makes the compiler a bit faster because it does not need to transform the AST into a string.
msg378181 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-10-07 21:23
Okay, but doesn't *any* way of handling multi-line annotations spoil the
wins? We cannot declare those illegal. (Though we can switch to a different
way of representing those.)
msg378182 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 21:33
> We cannot declare those illegal.

They won't be illegal, is just that the string generated for multi line annotations will be a bit uglier because it won't have spaces between characters.
msg378183 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-10-07 21:37
> the string generated for multi line annotations [...] won't have spaces between characters.

Okay, though aren't there cases where spaces are necessary? E.g. between names or between a name or keyword and a string or number.

Assuming this will only happen when the annotation spans multiple lines, right?
msg378184 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 21:42
> Okay, though aren't there cases where spaces are necessary? E.g. between names or between a name or keyword and a string or number.

Yeah, we are trying to modify the patch to automatically add spaces between keywords and other tokens.

> Assuming this will only happen when the annotation spans multiple lines, right?

Yeah, for annotations in a single line we can just grab the string from the code as it is.
msg378186 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 21:50
> Yeah, we are trying to modify the patch to automatically add spaces between keywords and other tokens.

Once we can see how this looks, we could think if is worth the effort or is better to abandon the idea. One could also argue that having multi-line annotations produce some uglier strings is inconsistent enough to drop this, but I think is easier to consider once we have a fully working version.
msg378189 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-07 23:05
After some extra testing, we found that is not enough with handling spaces, unfortunately, as things like the walrus need also the parentheses to be preserved. Given that multi-line strings will complicate the code considerably, I am going to drop this for the time being unless we have a good way to solve these problems in a way that does not involve several workarounds.
msg378193 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-10-07 23:28
Too bad. :-(
msg378201 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-10-08 00:06
For what it's worth, here's how f-strings with the "=" feature work:

I remember the char* pointer where the expression starts, then I parse the expression into an AST, then I note the char* pointer where the expression ended. The text between those is what's output before the equal sign []. This is how I preserve all of the whitespace inside the expression.

In my case I keep the AST to use when the expression gets evaluated, but in the string annotation case you'd throw it away. I don't think it would be very complicated to make this approach work across newlines.

[] Actually, I keep the equal sign itself and whitespace to the right of it, which is how f'{ x = }' produces " x = 42", instead of "x=42".
msg378204 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-08 00:13
> I remember the char* pointer where the expression starts, then I parse the expression into an AST, then I note the char* pointer where the expression ended. The text between those is what's output before the equal sign []. This is how I preserve all of the whitespace inside the expression.

Unless I am missing something, the problem is that the tokenizer has thrown away the extra newlines so we cannot keep the pointer where the expression starts. Also, I am not sure if this is the same scenario because f-strings cannot contain newlines in the expression brackets, can't they?
msg378206 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-08 00:17
For instance, consider this:

x: (
   3 + 
     4
   + 5) = 3

Once the expression in the annotations is parsed, the value of parser->tok->buf is "   +5) = 3" and the tokenizer has thrown away the rest of the lines and the information there. There is no way for us to "grab" the other lines and save them as they are consumed unless we add custom logic into the parser, which is not immediately easy because is auto-generated.
msg378207 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-10-08 00:19
It's true that f-string expressions can't contain newlines.

f-strings are definitely easier, because the tokenizer has already tokenized the string from the input, so I'm just remembering pointers inside the tokenized string.

I was thinking that maybe you could get access to the buffer that the tokenizer is using. I'd have to check to see if it's guaranteed to all be in one contiguous buffer or not.

Anyway, since the problem is at least superficially similar (at least to me!), I thought I'd mention how f-strings handle it. There might not be anything of value here to take away.
History
Date User Action Args
2020-10-08 00:19:06eric.smithsetmessages: + msg378207
2020-10-08 00:17:47pablogsalsetmessages: + msg378206
2020-10-08 00:13:05pablogsalsetmessages: + msg378204
2020-10-08 00:06:38eric.smithsetstatus: closed -> open

nosy: + eric.smith
messages: + msg378201

resolution: rejected ->
stage: resolved ->
2020-10-07 23:28:04gvanrossumsetmessages: + msg378193
2020-10-07 23:05:38pablogsalsetstatus: open -> closed
resolution: rejected
messages: + msg378189

stage: resolved
2020-10-07 21:50:50pablogsalsetmessages: + msg378186
2020-10-07 21:42:32pablogsalsetmessages: + msg378184
2020-10-07 21:37:17gvanrossumsetmessages: + msg378183
2020-10-07 21:33:38pablogsalsetmessages: + msg378182
2020-10-07 21:23:28gvanrossumsetmessages: + msg378181
2020-10-07 21:16:10pablogsalsetmessages: + msg378180
2020-10-07 20:57:25gvanrossumsetmessages: + msg378179
2020-10-07 20:41:38pablogsalsetmessages: + msg378177
2020-10-07 19:42:28BTaskayasetmessages: + msg378176
2020-10-07 19:36:02gvanrossumsetmessages: + msg378175
2020-10-07 18:39:20pablogsalsetmessages: + msg378174
2020-10-07 17:14:49gvanrossumsetmessages: + msg378172
2020-10-07 16:03:31BTaskayasetmessages: + msg378168
2020-10-07 15:21:21BTaskayasetmessages: + msg378166
2020-10-07 15:16:54pablogsalsetmessages: + msg378165
2020-10-07 15:03:41pablogsalsetmessages: + msg378164
2020-10-07 12:18:00pablogsalsetnosy: + BTaskaya
2020-10-07 12:15:32pablogsalsetnosy: + gvanrossum, lukasz.langa, lys.nikolaou

messages: + msg378159
stage: patch review -> (no value)
2020-10-07 12:14:33pablogsalsetkeywords: + patch
stage: patch review
pull_requests: + pull_request21579
2020-10-07 12:13:41pablogsalcreate