Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle annotations in the parser to avoid the need for roundtrip #86133

Closed
pablogsal opened this issue Oct 7, 2020 · 25 comments
Closed

Handle annotations in the parser to avoid the need for roundtrip #86133

pablogsal opened this issue Oct 7, 2020 · 25 comments

Comments

@pablogsal
Copy link
Member

BPO 41967
Nosy @gvanrossum, @ericvsmith, @ambv, @lysnikolaou, @pablogsal, @isidentical
PRs
  • bpo-41967: Handle annotations in the parser to avoid the need for roundtrip #22587
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2020-10-07.12:13:41.516>
    labels = []
    title = 'Handle annotations in the parser to avoid the need for roundtrip'
    updated_at = <Date 2020-10-08.00:19:06.597>
    user = 'https://github.com/pablogsal'

    bugs.python.org fields:

    activity = <Date 2020-10-08.00:19:06.597>
    actor = 'eric.smith'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = []
    creation = <Date 2020-10-07.12:13:41.516>
    creator = 'pablogsal'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 41967
    keywords = ['patch']
    message_count = 24.0
    messages = ['378158', '378159', '378164', '378165', '378166', '378168', '378172', '378174', '378175', '378176', '378177', '378179', '378180', '378181', '378182', '378183', '378184', '378186', '378189', '378193', '378201', '378204', '378206', '378207']
    nosy_count = 6.0
    nosy_names = ['gvanrossum', 'eric.smith', 'lukasz.langa', 'lys.nikolaou', 'pablogsal', 'BTaskaya']
    pr_nums = ['22587']
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue41967'
    versions = []

    @pablogsal
    Copy link
    Member Author

    For Python 3.10 now that annotations are always returned as strings, we can drop all the unparse logic by obtaining the string in the parser directly (after checking that the annotation is a valid expression).

    @pablogsal
    Copy link
    Member Author

    I have attached a draft PR for this:

    #22587

    I think this may be a considerable win as the diff is +66 −1,013

    @pablogsal
    Copy link
    Member Author

    After investigating with BTaskaya, seems that we cannot use the literal string from the tokenizer for multi-line annotations. But we could use the tokens (losing the literal whitespace), which honestly is less exciting because we will lose the literal formatting.

    @pablogsal
    Copy link
    Member Author

    On the other hand, multi-line annotations are rare enough we could consider falling back to token reconstruction then (the produced string will miss all whitespace in that case).

    @isidentical
    Copy link
    Sponsor Member

    After scanning 1552 annotated assignments in over 25k python modules (from most populer packages in PyPI) there are only 6 AnnAssign nodes where the annotations are multiline. For 516389 annotated arguments, only 20 of them are multiline annotations.

    @isidentical
    Copy link
    Sponsor Member

    @gvanrossum
    Copy link
    Member

    What's the issue exactly with multi-line annotations? Is it that they can't be parsed by eval()? Could that be fixed by replacing comments and newlines with spaces? (Either in the parser or in get_type_hints() -- actually ForwardRef._evaluate(), which calls eval().)

    @pablogsal
    Copy link
    Member Author

    What's the issue exactly with multi-line annotations?

    The issue is that for grabbing the source, we are inspecting the tokenize buffer and using the information in the expr node of the annotation to know what we need to pick from it.

    In multi-line annotations, the tokenize buffer does not contain the whole source, it just contains the last line.

    To make it work, we would need to grab the individual tokens (because the tokenizer has all of them) and append them together to form the expression, but in this process we lose the non-significant whitespace.

    ---

    Regarding loosing the whitespace for multi-line strings (or all string if we want to be consistent): the more I think about it the more I think is not that bad. This means that the string that the parser will place in the annotation field (and therefore the one that ends in __annotations__) will not have spaces, but the ast will be able to parse it without any problems. It will just look a bit crumbled, but that field is supposed to be used by tools, not humans, so I don't think is a huge problem.

    @gvanrossum
    Copy link
    Member

    Hm, this buffer thing is not what I had expected. I think it's a show-stopper -- if we don't reproduce spaces exactly.

    Maybe we should just buffer the entire input always?

    A way to handle multi-line annotations would be to surround them with parentheses? That way get_type_hints() doesn't need to change. Alternatively, strip comments and newlines in get_type_hints().

    @isidentical
    Copy link
    Sponsor Member

    It is not actually much of a difference;
    $ cat t.py
    def func(
    arg: typing.Callable[
    [int, str],
    str # test
    ]
    ):
    pass

    import typing
    print("Annotations: ", func.__annotations__)
    print("get_type_hints: ", typing.get_type_hints(func))
    $ ./python t.py
    Annotations:  {'arg': 'typing.Callable[[int,str],str]'}
    get_type_hints:  {'arg': typing.Callable[[int, str], str]}
    $ ./python
    >>> foo: (List[List[List[int, str, bytes]], List[(int, str)]])
    >>> __annotations__
    {'foo': 'List[List[List[int,str,bytes]],List[(int,str)]]'}

    @pablogsal
    Copy link
    Member Author

    Hm, this buffer thing is not what I had expected. I think it's a show-stopper -- if we don't reproduce spaces exactly.

    Hummm, I think I may have not clarified this in my previous messages: the space issue is an aesthetic one. The AST generated with and without spaces is the same. As far as I understand get_type_hints() will give the same result in all cases being considered.

    @gvanrossum
    Copy link
    Member

    I know the final result is the same (i.e. the AST is identical and the
    string differs only in the use of whitespace). I just wonder why bother
    changing this unless we reproduce the spaces in the string exactly.

    @pablogsal
    Copy link
    Member Author

    I just wonder why bother
    changing this unless we reproduce the spaces in the string exactly.

    Maintenance reasons: it removes over 1000 lines of hand written code. Additionally, makes the compiler a bit faster because it does not need to transform the AST into a string.

    @gvanrossum
    Copy link
    Member

    Okay, but doesn't *any* way of handling multi-line annotations spoil the
    wins? We cannot declare those illegal. (Though we can switch to a different
    way of representing those.)

    @pablogsal
    Copy link
    Member Author

    We cannot declare those illegal.

    They won't be illegal, is just that the string generated for multi line annotations will be a bit uglier because it won't have spaces between characters.

    @gvanrossum
    Copy link
    Member

    the string generated for multi line annotations [...] won't have spaces between characters.

    Okay, though aren't there cases where spaces are necessary? E.g. between names or between a name or keyword and a string or number.

    Assuming this will only happen when the annotation spans multiple lines, right?

    @pablogsal
    Copy link
    Member Author

    Okay, though aren't there cases where spaces are necessary? E.g. between names or between a name or keyword and a string or number.

    Yeah, we are trying to modify the patch to automatically add spaces between keywords and other tokens.

    Assuming this will only happen when the annotation spans multiple lines, right?

    Yeah, for annotations in a single line we can just grab the string from the code as it is.

    @pablogsal
    Copy link
    Member Author

    Yeah, we are trying to modify the patch to automatically add spaces between keywords and other tokens.

    Once we can see how this looks, we could think if is worth the effort or is better to abandon the idea. One could also argue that having multi-line annotations produce some uglier strings is inconsistent enough to drop this, but I think is easier to consider once we have a fully working version.

    @pablogsal
    Copy link
    Member Author

    After some extra testing, we found that is not enough with handling spaces, unfortunately, as things like the walrus need also the parentheses to be preserved. Given that multi-line strings will complicate the code considerably, I am going to drop this for the time being unless we have a good way to solve these problems in a way that does not involve several workarounds.

    @gvanrossum
    Copy link
    Member

    Too bad. :-(

    @ericvsmith
    Copy link
    Member

    For what it's worth, here's how f-strings with the "=" feature work:

    I remember the char* pointer where the expression starts, then I parse the expression into an AST, then I note the char* pointer where the expression ended. The text between those is what's output before the equal sign []. This is how I preserve all of the whitespace inside the expression.

    In my case I keep the AST to use when the expression gets evaluated, but in the string annotation case you'd throw it away. I don't think it would be very complicated to make this approach work across newlines.

    [] Actually, I keep the equal sign itself and whitespace to the right of it, which is how f'{ x = }' produces " x = 42", instead of "x=42".

    @ericvsmith ericvsmith reopened this Oct 8, 2020
    @ericvsmith ericvsmith reopened this Oct 8, 2020
    @pablogsal
    Copy link
    Member Author

    I remember the char* pointer where the expression starts, then I parse the expression into an AST, then I note the char* pointer where the expression ended. The text between those is what's output before the equal sign []. This is how I preserve all of the whitespace inside the expression.

    Unless I am missing something, the problem is that the tokenizer has thrown away the extra newlines so we cannot keep the pointer where the expression starts. Also, I am not sure if this is the same scenario because f-strings cannot contain newlines in the expression brackets, can't they?

    @pablogsal
    Copy link
    Member Author

    For instance, consider this:

    x: (
    3 +
    4
    + 5) = 3

    Once the expression in the annotations is parsed, the value of parser->tok->buf is " +5) = 3" and the tokenizer has thrown away the rest of the lines and the information there. There is no way for us to "grab" the other lines and save them as they are consumed unless we add custom logic into the parser, which is not immediately easy because is auto-generated.

    @ericvsmith
    Copy link
    Member

    It's true that f-string expressions can't contain newlines.

    f-strings are definitely easier, because the tokenizer has already tokenized the string from the input, so I'm just remembering pointers inside the tokenized string.

    I was thinking that maybe you could get access to the buffer that the tokenizer is using. I'd have to check to see if it's guaranteed to all be in one contiguous buffer or not.

    Anyway, since the problem is at least superficially similar (at least to me!), I thought I'd mention how f-strings handle it. There might not be anything of value here to take away.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @lysnikolaou
    Copy link
    Contributor

    This can be closed. If anyone takes it up again, feel free to reopen.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants