Issue 41967: Handle annotations in the parser to avoid the need for roundtrip

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/86133

classification

Title:	Handle annotations in the parser to avoid the need for roundtrip
Type:		Stage:
Components:		Versions:

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	BTaskaya, eric.smith, gvanrossum, lukasz.langa, lys.nikolaou, pablogsal
Priority:	normal	Keywords:	patch

Created on 2020-10-07 12:13 by pablogsal, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 22587	closed	pablogsal, 2020-10-07 12:14

Messages (24)
msg378158 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 12:13
For Python 3.10 now that annotations are always returned as strings, we can drop all the unparse logic by obtaining the string in the parser directly (after checking that the annotation is a valid expression).
msg378159 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 12:15
I have attached a draft PR for this: https://github.com/python/cpython/pull/22587 I think this may be a considerable win as the diff is +66 −1,013
msg378164 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 15:03
After investigating with BTaskaya, seems that we cannot use the literal string from the tokenizer for multi-line annotations. But we could use the tokens (losing the literal whitespace), which honestly is less exciting because we will lose the literal formatting.
msg378165 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 15:16
On the other hand, multi-line annotations are rare enough we could consider falling back to token reconstruction then (the produced string will miss all whitespace in that case).
msg378166 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2020-10-07 15:21
After scanning 1552 annotated assignments in over 25k python modules (from most populer packages in PyPI) there are only 6 AnnAssign nodes where the annotations are multiline. For 516389 annotated arguments, only 20 of them are multiline annotations.
msg378168 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2020-10-07 16:03
s/For 516389 annotated arguments/For 516389 arguments/. Example usages; https://search.tree.science/?query=AnnAssign(annotation%20=%20expr(lineno%20=%20not%20~end_lineno)) https://search.tree.science/?query=arg(annotation=expr(lineno%20=%20not%20~end_lineno))
msg378172 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-10-07 17:14
What's the issue exactly with multi-line annotations? Is it that they can't be parsed by eval()? Could that be fixed by replacing comments and newlines with spaces? (Either in the parser or in get_type_hints() -- actually ForwardRef._evaluate(), which calls eval().)
msg378174 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 18:39
> What's the issue exactly with multi-line annotations? The issue is that for grabbing the source, we are inspecting the tokenize buffer and using the information in the expr node of the annotation to know what we need to pick from it. In multi-line annotations, the tokenize buffer does not contain the whole source, it just contains the last line. To make it work, we would need to grab the individual tokens (because the tokenizer has all of them) and append them together to form the expression, but in this process we lose the non-significant whitespace. --- Regarding loosing the whitespace for multi-line strings (or all string if we want to be consistent): the more I think about it the more I think is not that bad. This means that the string that the parser will place in the annotation field (and therefore the one that ends in __annotations__) will not have spaces, but the ast will be able to parse it without any problems. It will just look a bit crumbled, but that field is supposed to be used by tools, not humans, so I don't think is a huge problem.
msg378175 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-10-07 19:36
Hm, this buffer thing is not what I had expected. I think it's a show-stopper -- if we don't reproduce spaces exactly. Maybe we should just buffer the entire input always? A way to handle multi-line annotations would be to surround them with parentheses? That way get_type_hints() doesn't need to change. Alternatively, strip comments and newlines in get_type_hints().
msg378176 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2020-10-07 19:42
It is not actually much of a difference; $ cat t.py def func( arg: typing.Callable[ [int, str], str # test ] ): pass import typing print("Annotations: ", func.__annotations__) print("get_type_hints: ", typing.get_type_hints(func)) $ ./python t.py Annotations: {'arg': 'typing.Callable[[int,str],str]'} get_type_hints: {'arg': typing.Callable[[int, str], str]} $ ./python >>> foo: (List[List[List[int, str, bytes]], List[(int, str)]]) >>> __annotations__ {'foo': 'List[List[List[int,str,bytes]],List[(int,str)]]'}
msg378177 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 20:41
> Hm, this buffer thing is not what I had expected. I think it's a show-stopper -- if we don't reproduce spaces exactly. Hummm, I think I may have not clarified this in my previous messages: the space issue is an aesthetic one. The AST generated with and without spaces is the same. As far as I understand get_type_hints() will give the same result in all cases being considered.
msg378179 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-10-07 20:57
I know the final result is the same (i.e. the AST is identical and the string differs only in the use of whitespace). I just wonder why bother changing this unless we reproduce the spaces in the string exactly.
msg378180 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 21:16
> I just wonder why bother changing this unless we reproduce the spaces in the string exactly. Maintenance reasons: it removes over 1000 lines of hand written code. Additionally, makes the compiler a bit faster because it does not need to transform the AST into a string.
msg378181 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-10-07 21:23
Okay, but doesn't any way of handling multi-line annotations spoil the wins? We cannot declare those illegal. (Though we can switch to a different way of representing those.)
msg378182 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 21:33
> We cannot declare those illegal. They won't be illegal, is just that the string generated for multi line annotations will be a bit uglier because it won't have spaces between characters.
msg378183 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-10-07 21:37
> the string generated for multi line annotations [...] won't have spaces between characters. Okay, though aren't there cases where spaces are necessary? E.g. between names or between a name or keyword and a string or number. Assuming this will only happen when the annotation spans multiple lines, right?
msg378184 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 21:42
> Okay, though aren't there cases where spaces are necessary? E.g. between names or between a name or keyword and a string or number. Yeah, we are trying to modify the patch to automatically add spaces between keywords and other tokens. > Assuming this will only happen when the annotation spans multiple lines, right? Yeah, for annotations in a single line we can just grab the string from the code as it is.
msg378186 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 21:50
> Yeah, we are trying to modify the patch to automatically add spaces between keywords and other tokens. Once we can see how this looks, we could think if is worth the effort or is better to abandon the idea. One could also argue that having multi-line annotations produce some uglier strings is inconsistent enough to drop this, but I think is easier to consider once we have a fully working version.
msg378189 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-07 23:05
After some extra testing, we found that is not enough with handling spaces, unfortunately, as things like the walrus need also the parentheses to be preserved. Given that multi-line strings will complicate the code considerably, I am going to drop this for the time being unless we have a good way to solve these problems in a way that does not involve several workarounds.
msg378193 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-10-07 23:28
Too bad. :-(
msg378201 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-10-08 00:06
For what it's worth, here's how f-strings with the "=" feature work: I remember the char* pointer where the expression starts, then I parse the expression into an AST, then I note the char* pointer where the expression ended. The text between those is what's output before the equal sign []. This is how I preserve all of the whitespace inside the expression. In my case I keep the AST to use when the expression gets evaluated, but in the string annotation case you'd throw it away. I don't think it would be very complicated to make this approach work across newlines. [] Actually, I keep the equal sign itself and whitespace to the right of it, which is how f'{ x = }' produces " x = 42", instead of "x=42".
msg378204 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-08 00:13
> I remember the char* pointer where the expression starts, then I parse the expression into an AST, then I note the char* pointer where the expression ended. The text between those is what's output before the equal sign []. This is how I preserve all of the whitespace inside the expression. Unless I am missing something, the problem is that the tokenizer has thrown away the extra newlines so we cannot keep the pointer where the expression starts. Also, I am not sure if this is the same scenario because f-strings cannot contain newlines in the expression brackets, can't they?
msg378206 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-08 00:17
For instance, consider this: x: ( 3 + 4 + 5) = 3 Once the expression in the annotations is parsed, the value of parser->tok->buf is " +5) = 3" and the tokenizer has thrown away the rest of the lines and the information there. There is no way for us to "grab" the other lines and save them as they are consumed unless we add custom logic into the parser, which is not immediately easy because is auto-generated.
msg378207 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-10-08 00:19
It's true that f-string expressions can't contain newlines. f-strings are definitely easier, because the tokenizer has already tokenized the string from the input, so I'm just remembering pointers inside the tokenized string. I was thinking that maybe you could get access to the buffer that the tokenizer is using. I'd have to check to see if it's guaranteed to all be in one contiguous buffer or not. Anyway, since the problem is at least superficially similar (at least to me!), I thought I'd mention how f-strings handle it. There might not be anything of value here to take away.

History
Date	User	Action	Args
2022-04-11 14:59:36	admin	set	github: 86133
2020-10-08 00:19:06	eric.smith	set	messages: + msg378207
2020-10-08 00:17:47	pablogsal	set	messages: + msg378206
2020-10-08 00:13:05	pablogsal	set	messages: + msg378204
2020-10-08 00:06:38	eric.smith	set	status: closed -> open nosy: + eric.smith messages: + msg378201 resolution: rejected -> stage: resolved ->
2020-10-07 23:28:04	gvanrossum	set	messages: + msg378193
2020-10-07 23:05:38	pablogsal	set	status: open -> closed resolution: rejected messages: + msg378189 stage: resolved
2020-10-07 21:50:50	pablogsal	set	messages: + msg378186
2020-10-07 21:42:32	pablogsal	set	messages: + msg378184
2020-10-07 21:37:17	gvanrossum	set	messages: + msg378183
2020-10-07 21:33:38	pablogsal	set	messages: + msg378182
2020-10-07 21:23:28	gvanrossum	set	messages: + msg378181
2020-10-07 21:16:10	pablogsal	set	messages: + msg378180
2020-10-07 20:57:25	gvanrossum	set	messages: + msg378179
2020-10-07 20:41:38	pablogsal	set	messages: + msg378177
2020-10-07 19:42:28	BTaskaya	set	messages: + msg378176
2020-10-07 19:36:02	gvanrossum	set	messages: + msg378175
2020-10-07 18:39:20	pablogsal	set	messages: + msg378174
2020-10-07 17:14:49	gvanrossum	set	messages: + msg378172
2020-10-07 16:03:31	BTaskaya	set	messages: + msg378168
2020-10-07 15:21:21	BTaskaya	set	messages: + msg378166
2020-10-07 15:16:54	pablogsal	set	messages: + msg378165
2020-10-07 15:03:41	pablogsal	set	messages: + msg378164
2020-10-07 12:18:00	pablogsal	set	nosy: + BTaskaya
2020-10-07 12:15:32	pablogsal	set	nosy: + gvanrossum, lukasz.langa, lys.nikolaou messages: + msg378159 stage: patch review -> (no value)
2020-10-07 12:14:33	pablogsal	set	keywords: + patch stage: patch review pull_requests: + pull_request21579
2020-10-07 12:13:41	pablogsal	create