Handle annotations in the parser to avoid the need for roundtrip #86133

pablogsal · 2020-10-07T12:13:42Z

BPO	41967
Nosy	@gvanrossum, @ericvsmith, @ambv, @lysnikolaou, @pablogsal, @isidentical
PRs	bpo-41967: Handle annotations in the parser to avoid the need for roundtrip #22587

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-10-07.12:13:41.516>
labels = []
title = 'Handle annotations in the parser to avoid the need for roundtrip'
updated_at = <Date 2020-10-08.00:19:06.597>
user = 'https://github.com/pablogsal'

bugs.python.org fields:

activity = <Date 2020-10-08.00:19:06.597>
actor = 'eric.smith'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2020-10-07.12:13:41.516>
creator = 'pablogsal'
dependencies = []
files = []
hgrepos = []
issue_num = 41967
keywords = ['patch']
message_count = 24.0
messages = ['378158', '378159', '378164', '378165', '378166', '378168', '378172', '378174', '378175', '378176', '378177', '378179', '378180', '378181', '378182', '378183', '378184', '378186', '378189', '378193', '378201', '378204', '378206', '378207']
nosy_count = 6.0
nosy_names = ['gvanrossum', 'eric.smith', 'lukasz.langa', 'lys.nikolaou', 'pablogsal', 'BTaskaya']
pr_nums = ['22587']
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue41967'
versions = []

pablogsal · 2020-10-07T12:13:41Z

For Python 3.10 now that annotations are always returned as strings, we can drop all the unparse logic by obtaining the string in the parser directly (after checking that the annotation is a valid expression).

pablogsal · 2020-10-07T12:15:32Z

I have attached a draft PR for this:

#22587

I think this may be a considerable win as the diff is +66 −1,013

pablogsal · 2020-10-07T15:03:41Z

After investigating with BTaskaya, seems that we cannot use the literal string from the tokenizer for multi-line annotations. But we could use the tokens (losing the literal whitespace), which honestly is less exciting because we will lose the literal formatting.

pablogsal · 2020-10-07T15:16:54Z

On the other hand, multi-line annotations are rare enough we could consider falling back to token reconstruction then (the produced string will miss all whitespace in that case).

isidentical · 2020-10-07T15:21:22Z

After scanning 1552 annotated assignments in over 25k python modules (from most populer packages in PyPI) there are only 6 AnnAssign nodes where the annotations are multiline. For 516389 annotated arguments, only 20 of them are multiline annotations.

isidentical · 2020-10-07T16:03:31Z

s/For 516389 annotated arguments/For 516389 arguments/.

Example usages;
https://search.tree.science/?query=AnnAssign(annotation%20=%20expr(lineno%20=%20not%20\~end_lineno))

https://search.tree.science/?query=arg(annotation=expr(lineno%20=%20not%20\~end_lineno))

gvanrossum · 2020-10-07T17:14:49Z

What's the issue exactly with multi-line annotations? Is it that they can't be parsed by eval()? Could that be fixed by replacing comments and newlines with spaces? (Either in the parser or in get_type_hints() -- actually ForwardRef._evaluate(), which calls eval().)

pablogsal · 2020-10-07T18:39:20Z

What's the issue exactly with multi-line annotations?

The issue is that for grabbing the source, we are inspecting the tokenize buffer and using the information in the expr node of the annotation to know what we need to pick from it.

In multi-line annotations, the tokenize buffer does not contain the whole source, it just contains the last line.

To make it work, we would need to grab the individual tokens (because the tokenizer has all of them) and append them together to form the expression, but in this process we lose the non-significant whitespace.

---

Regarding loosing the whitespace for multi-line strings (or all string if we want to be consistent): the more I think about it the more I think is not that bad. This means that the string that the parser will place in the annotation field (and therefore the one that ends in __annotations__) will not have spaces, but the ast will be able to parse it without any problems. It will just look a bit crumbled, but that field is supposed to be used by tools, not humans, so I don't think is a huge problem.

gvanrossum · 2020-10-07T19:36:02Z

Hm, this buffer thing is not what I had expected. I think it's a show-stopper -- if we don't reproduce spaces exactly.

Maybe we should just buffer the entire input always?

A way to handle multi-line annotations would be to surround them with parentheses? That way get_type_hints() doesn't need to change. Alternatively, strip comments and newlines in get_type_hints().

isidentical · 2020-10-07T19:42:28Z

It is not actually much of a difference;
$ cat t.py
def func(
arg: typing.Callable[
[int, str],
str # test
]
):
pass

import typing
print("Annotations: ", func.__annotations__)
print("get_type_hints: ", typing.get_type_hints(func))
$ ./python t.py
Annotations:  {'arg': 'typing.Callable[[int,str],str]'}
get_type_hints:  {'arg': typing.Callable[[int, str], str]}

$ ./python
>>> foo: (List[List[List[int, str, bytes]], List[(int, str)]])
>>> __annotations__
{'foo': 'List[List[List[int,str,bytes]],List[(int,str)]]'}

pablogsal · 2020-10-07T20:41:38Z

Hm, this buffer thing is not what I had expected. I think it's a show-stopper -- if we don't reproduce spaces exactly.

Hummm, I think I may have not clarified this in my previous messages: the space issue is an aesthetic one. The AST generated with and without spaces is the same. As far as I understand get_type_hints() will give the same result in all cases being considered.

gvanrossum · 2020-10-07T20:57:26Z

I know the final result is the same (i.e. the AST is identical and the
string differs only in the use of whitespace). I just wonder why bother
changing this unless we reproduce the spaces in the string exactly.

pablogsal · 2020-10-07T21:16:11Z

I just wonder why bother
changing this unless we reproduce the spaces in the string exactly.

Maintenance reasons: it removes over 1000 lines of hand written code. Additionally, makes the compiler a bit faster because it does not need to transform the AST into a string.

gvanrossum · 2020-10-07T21:23:28Z

Okay, but doesn't *any* way of handling multi-line annotations spoil the
wins? We cannot declare those illegal. (Though we can switch to a different
way of representing those.)

pablogsal · 2020-10-07T21:33:39Z

We cannot declare those illegal.

They won't be illegal, is just that the string generated for multi line annotations will be a bit uglier because it won't have spaces between characters.

gvanrossum · 2020-10-07T21:37:17Z

the string generated for multi line annotations [...] won't have spaces between characters.

Okay, though aren't there cases where spaces are necessary? E.g. between names or between a name or keyword and a string or number.

Assuming this will only happen when the annotation spans multiple lines, right?

pablogsal · 2020-10-07T21:42:33Z

Okay, though aren't there cases where spaces are necessary? E.g. between names or between a name or keyword and a string or number.

Yeah, we are trying to modify the patch to automatically add spaces between keywords and other tokens.

Assuming this will only happen when the annotation spans multiple lines, right?

Yeah, for annotations in a single line we can just grab the string from the code as it is.

pablogsal · 2020-10-07T21:50:50Z

Yeah, we are trying to modify the patch to automatically add spaces between keywords and other tokens.

Once we can see how this looks, we could think if is worth the effort or is better to abandon the idea. One could also argue that having multi-line annotations produce some uglier strings is inconsistent enough to drop this, but I think is easier to consider once we have a fully working version.

pablogsal · 2020-10-07T23:05:37Z

After some extra testing, we found that is not enough with handling spaces, unfortunately, as things like the walrus need also the parentheses to be preserved. Given that multi-line strings will complicate the code considerably, I am going to drop this for the time being unless we have a good way to solve these problems in a way that does not involve several workarounds.

gvanrossum · 2020-10-07T23:28:04Z

Too bad. :-(

ericvsmith · 2020-10-08T00:06:37Z

For what it's worth, here's how f-strings with the "=" feature work:

I remember the char* pointer where the expression starts, then I parse the expression into an AST, then I note the char* pointer where the expression ended. The text between those is what's output before the equal sign []. This is how I preserve all of the whitespace inside the expression.

In my case I keep the AST to use when the expression gets evaluated, but in the string annotation case you'd throw it away. I don't think it would be very complicated to make this approach work across newlines.

[] Actually, I keep the equal sign itself and whitespace to the right of it, which is how f'{ x = }' produces " x = 42", instead of "x=42".

pablogsal · 2020-10-08T00:13:06Z

I remember the char* pointer where the expression starts, then I parse the expression into an AST, then I note the char* pointer where the expression ended. The text between those is what's output before the equal sign []. This is how I preserve all of the whitespace inside the expression.

Unless I am missing something, the problem is that the tokenizer has thrown away the extra newlines so we cannot keep the pointer where the expression starts. Also, I am not sure if this is the same scenario because f-strings cannot contain newlines in the expression brackets, can't they?

pablogsal · 2020-10-08T00:17:48Z

For instance, consider this:

x: (
3 +
4
+ 5) = 3

Once the expression in the annotations is parsed, the value of parser->tok->buf is " +5) = 3" and the tokenizer has thrown away the rest of the lines and the information there. There is no way for us to "grab" the other lines and save them as they are consumed unless we add custom logic into the parser, which is not immediately easy because is auto-generated.

ericvsmith · 2020-10-08T00:19:07Z

It's true that f-string expressions can't contain newlines.

f-strings are definitely easier, because the tokenizer has already tokenized the string from the input, so I'm just remembering pointers inside the tokenized string.

I was thinking that maybe you could get access to the buffer that the tokenizer is using. I'd have to check to see if it's guaranteed to all be in one contiguous buffer or not.

Anyway, since the problem is at least superficially similar (at least to me!), I thought I'd mention how f-strings handle it. There might not be anything of value here to take away.

lysnikolaou · 2023-06-19T10:46:40Z

This can be closed. If anyone takes it up again, feel free to reopen.

pablogsal closed this as completed Oct 7, 2020

ericvsmith reopened this Oct 8, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

lysnikolaou closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle annotations in the parser to avoid the need for roundtrip #86133

Handle annotations in the parser to avoid the need for roundtrip #86133

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

isidentical commented Oct 7, 2020

isidentical commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

isidentical commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

ericvsmith commented Oct 8, 2020

pablogsal commented Oct 8, 2020

pablogsal commented Oct 8, 2020

ericvsmith commented Oct 8, 2020

lysnikolaou commented Jun 19, 2023

Handle annotations in the parser to avoid the need for roundtrip #86133

Handle annotations in the parser to avoid the need for roundtrip #86133

Comments

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

isidentical commented Oct 7, 2020

isidentical commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

isidentical commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

pablogsal commented Oct 7, 2020

gvanrossum commented Oct 7, 2020

ericvsmith commented Oct 8, 2020

pablogsal commented Oct 8, 2020

pablogsal commented Oct 8, 2020

ericvsmith commented Oct 8, 2020

lysnikolaou commented Jun 19, 2023