New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle annotations in the parser to avoid the need for roundtrip #86133
Comments
For Python 3.10 now that annotations are always returned as strings, we can drop all the unparse logic by obtaining the string in the parser directly (after checking that the annotation is a valid expression). |
I have attached a draft PR for this: I think this may be a considerable win as the diff is +66 −1,013 |
After investigating with BTaskaya, seems that we cannot use the literal string from the tokenizer for multi-line annotations. But we could use the tokens (losing the literal whitespace), which honestly is less exciting because we will lose the literal formatting. |
On the other hand, multi-line annotations are rare enough we could consider falling back to token reconstruction then (the produced string will miss all whitespace in that case). |
After scanning 1552 annotated assignments in over 25k python modules (from most populer packages in PyPI) there are only 6 AnnAssign nodes where the annotations are multiline. For 516389 annotated arguments, only 20 of them are multiline annotations. |
s/For 516389 annotated arguments/For 516389 arguments/. Example usages; https://search.tree.science/?query=arg(annotation=expr(lineno%20=%20not%20\~end_lineno)) |
What's the issue exactly with multi-line annotations? Is it that they can't be parsed by eval()? Could that be fixed by replacing comments and newlines with spaces? (Either in the parser or in get_type_hints() -- actually ForwardRef._evaluate(), which calls eval().) |
The issue is that for grabbing the source, we are inspecting the tokenize buffer and using the information in the expr node of the annotation to know what we need to pick from it. In multi-line annotations, the tokenize buffer does not contain the whole source, it just contains the last line. To make it work, we would need to grab the individual tokens (because the tokenizer has all of them) and append them together to form the expression, but in this process we lose the non-significant whitespace. --- Regarding loosing the whitespace for multi-line strings (or all string if we want to be consistent): the more I think about it the more I think is not that bad. This means that the string that the parser will place in the annotation field (and therefore the one that ends in __annotations__) will not have spaces, but the ast will be able to parse it without any problems. It will just look a bit crumbled, but that field is supposed to be used by tools, not humans, so I don't think is a huge problem. |
Hm, this buffer thing is not what I had expected. I think it's a show-stopper -- if we don't reproduce spaces exactly. Maybe we should just buffer the entire input always? A way to handle multi-line annotations would be to surround them with parentheses? That way get_type_hints() doesn't need to change. Alternatively, strip comments and newlines in get_type_hints(). |
It is not actually much of a difference; import typing
print("Annotations: ", func.__annotations__)
print("get_type_hints: ", typing.get_type_hints(func))
$ ./python t.py
Annotations: {'arg': 'typing.Callable[[int,str],str]'}
get_type_hints: {'arg': typing.Callable[[int, str], str]} $ ./python
>>> foo: (List[List[List[int, str, bytes]], List[(int, str)]])
>>> __annotations__
{'foo': 'List[List[List[int,str,bytes]],List[(int,str)]]'} |
Hummm, I think I may have not clarified this in my previous messages: the space issue is an aesthetic one. The AST generated with and without spaces is the same. As far as I understand get_type_hints() will give the same result in all cases being considered. |
I know the final result is the same (i.e. the AST is identical and the |
Maintenance reasons: it removes over 1000 lines of hand written code. Additionally, makes the compiler a bit faster because it does not need to transform the AST into a string. |
Okay, but doesn't *any* way of handling multi-line annotations spoil the |
They won't be illegal, is just that the string generated for multi line annotations will be a bit uglier because it won't have spaces between characters. |
Okay, though aren't there cases where spaces are necessary? E.g. between names or between a name or keyword and a string or number. Assuming this will only happen when the annotation spans multiple lines, right? |
Yeah, we are trying to modify the patch to automatically add spaces between keywords and other tokens.
Yeah, for annotations in a single line we can just grab the string from the code as it is. |
Once we can see how this looks, we could think if is worth the effort or is better to abandon the idea. One could also argue that having multi-line annotations produce some uglier strings is inconsistent enough to drop this, but I think is easier to consider once we have a fully working version. |
After some extra testing, we found that is not enough with handling spaces, unfortunately, as things like the walrus need also the parentheses to be preserved. Given that multi-line strings will complicate the code considerably, I am going to drop this for the time being unless we have a good way to solve these problems in a way that does not involve several workarounds. |
Too bad. :-( |
For what it's worth, here's how f-strings with the "=" feature work: I remember the char* pointer where the expression starts, then I parse the expression into an AST, then I note the char* pointer where the expression ended. The text between those is what's output before the equal sign []. This is how I preserve all of the whitespace inside the expression. In my case I keep the AST to use when the expression gets evaluated, but in the string annotation case you'd throw it away. I don't think it would be very complicated to make this approach work across newlines. [] Actually, I keep the equal sign itself and whitespace to the right of it, which is how f'{ x = }' produces " x = 42", instead of "x=42". |
Unless I am missing something, the problem is that the tokenizer has thrown away the extra newlines so we cannot keep the pointer where the expression starts. Also, I am not sure if this is the same scenario because f-strings cannot contain newlines in the expression brackets, can't they? |
For instance, consider this: x: ( Once the expression in the annotations is parsed, the value of parser->tok->buf is " +5) = 3" and the tokenizer has thrown away the rest of the lines and the information there. There is no way for us to "grab" the other lines and save them as they are consumed unless we add custom logic into the parser, which is not immediately easy because is auto-generated. |
It's true that f-string expressions can't contain newlines. f-strings are definitely easier, because the tokenizer has already tokenized the string from the input, so I'm just remembering pointers inside the tokenized string. I was thinking that maybe you could get access to the buffer that the tokenizer is using. I'd have to check to see if it's guaranteed to all be in one contiguous buffer or not. Anyway, since the problem is at least superficially similar (at least to me!), I thought I'd mention how f-strings handle it. There might not be anything of value here to take away. |
This can be closed. If anyone takes it up again, feel free to reopen. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: