Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make match patterns explicit in the AST #88058

Closed
ncoghlan opened this issue Apr 20, 2021 · 26 comments
Closed

Make match patterns explicit in the AST #88058

ncoghlan opened this issue Apr 20, 2021 · 26 comments
Assignees
Labels
3.10 only security fixes type-feature A feature request or enhancement

Comments

@ncoghlan
Copy link
Contributor

BPO 43892
Nosy @ncoghlan, @markshannon, @gvanrossum, @pablogsal, @brandtbucher, @isidentical, @freundTech
PRs
  • WIP bpo-43892: Make match patterns explicit in AST #25521
  • bpo-43892: Make match patterns explicit in AST #25585
  • bpo-43892: Validate the first term of complex literal value patterns #25735
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/brandtbucher'
    closed_at = <Date 2021-04-29.11:27:27.671>
    created_at = <Date 2021-04-20.11:59:54.593>
    labels = ['type-feature', '3.10']
    title = 'Make match patterns explicit in the AST'
    updated_at = <Date 2021-04-30.00:19:37.153>
    user = 'https://github.com/ncoghlan'

    bugs.python.org fields:

    activity = <Date 2021-04-30.00:19:37.153>
    actor = 'brandtbucher'
    assignee = 'brandtbucher'
    closed = True
    closed_date = <Date 2021-04-29.11:27:27.671>
    closer = 'ncoghlan'
    components = []
    creation = <Date 2021-04-20.11:59:54.593>
    creator = 'ncoghlan'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43892
    keywords = ['patch']
    message_count = 26.0
    messages = ['391430', '391433', '391437', '391438', '391448', '391457', '391471', '391485', '391599', '391610', '391622', '391668', '391672', '391674', '391687', '391694', '391695', '391712', '391842', '391844', '391845', '391943', '391950', '392279', '392300', '392368']
    nosy_count = 7.0
    nosy_names = ['ncoghlan', 'Mark.Shannon', 'Guido.van.Rossum', 'pablogsal', 'brandtbucher', 'BTaskaya', 'freundTech']
    pr_nums = ['25521', '25585', '25735']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue43892'
    versions = ['Python 3.10']

    @ncoghlan
    Copy link
    Contributor Author

    In the SC submission ticket for PEP-642 (python/steering-council#43 ), Guido indicated he was in favour of the more explicit pattern matching AST changes suggested in that PEP.

    This ticket covers adapting the pattern matching implementation to explicitly separate pattern nodes from expr nodes in the AST, so the code generator doesn't need to be context dependent.

    @ncoghlan ncoghlan self-assigned this Apr 20, 2021
    @ncoghlan ncoghlan added 3.10 only security fixes type-feature A feature request or enhancement labels Apr 20, 2021
    @ncoghlan ncoghlan self-assigned this Apr 20, 2021
    @ncoghlan ncoghlan added 3.10 only security fixes type-feature A feature request or enhancement labels Apr 20, 2021
    @pablogsal
    Copy link
    Member

    Given that the AST is a public interface, I would advise to implement this before beta freeze of 3.10 to avoid making this a breaking change in the future for linters and the like

    @ncoghlan
    Copy link
    Contributor Author

    Agreed. I had wanted the AST to be part of the PEPs specifically *because* it's a public API, but didn't check until today whether or not the original PEP-634 implementation had been merged as-is, or with a cleaned up AST definition.

    https://github.com/ncoghlan/cpython/pull/8/files is the initial implementation with the AST and Grammar file updated.

    I'll only create the CPython PR once the tests are passing, but I'll try to give Brandt at least a week to review it (I'll also ping him directly via email).

    @ncoghlan
    Copy link
    Contributor Author

    It's specifically the definition of "match_case" in the AST that is affected:

    match_case = (pattern pattern, expr? guard, stmt* body)
    
        pattern = MatchAlways
             | MatchValue(expr value)
             | MatchConstant(constant value)
             | MatchSequence(pattern* patterns)
             | MatchMapping(expr* keys, pattern* patterns)
             | MatchClass(expr cls, pattern* patterns, identifier* extra_attrs, pattern* extra_patterns)
         | MatchRestOfSequence(identifier? target)
         -- A NULL entry in the MatchMapping key list handles capturing extra mapping keys
    
         | MatchAs(pattern? pattern, identifier target)
         | MatchOr(pattern* patterns)
          attributes (int lineno, int col_offset, int? end_lineno, int? end_col_offset)
    

    Relative to the PEP-642 AST, the notion of a "matchop" is gone - MatchValue always compares by equality, and there's a MatchConstant node for the identity comparisons against None, True, and False.

    The grammar, code generator, AST validator, and unparser are then updated to produce or consume the new AST nodes.

    @brandtbucher
    Copy link
    Member

    Hm, for some reason I was thinking that this was safe to do after the feature freeze. Let's get to it then!

    Batuhan, perhaps we should change the linked issue for your AST validator PR to this one. That way we can close the old catch-all PEP-634 implementation issue and keep the discussion focused here.

    Nick, you might find the discussion on this mypy PR (python/mypy#10191) helpful. In particular, Adrian raises some points about ways we could make type inference in the AST a bit neater. For example: not all patterns can be used for mapping keys. Perhaps there is a way to organize parts of the AST hierarchy that makes this a bit more concrete (though we should be careful to not limit reasonable syntax extensions in the future by doing so).

    A few notes, just from skimming the outline here:

    MatchValue(expr value)

    MatchValue technically contains an expression node, although the actual expressions it can contain are quite limited (dotted names and negative/complex numbers, if I understand correctly). Can we gain anything by splitting this up into nodes for each of these (MatchValue(identifier* parts), etc...) instead?

    MatchClass(expr cls, pattern* patterns, identifier* extra_attrs, pattern* extra_patterns)

    If I remember correctly (this part was implemented a while ago), collecting the different positional and keyword arguments in the parser is sort of simple since we can pass around sequences of keyword nodes easily. I *think* that the new proposed design would require hacks like creating dummy MatchClass nodes with *only* the keyword parts and later combining them with the positional arguments and a class name, since it keeps the keyword names separate from their arguments. Maybe some of the parser folks on this thread can chime in on that.

    Also, continuing the theme above, I think we can be more specific with "expr cls" here (maybe "identifier* cls").

    MatchRestOfSequence(identifier? target)

    In the interest of brevity, maybe "MatchStar" or something?

    A NULL entry in the MatchMapping key list handles capturing extra mapping keys

    I think we could just give MatchMapping a "identifier? rest" member, right? That way it becomes a bit easier to use, and we don't need to worry if it's last or not. It also keeps us from having to handle empty entries in the keys.

    Oh, also, the ast unparser will need to be updated as well.

    @pablogsal
    Copy link
    Member

    Also, I would like some more extensive argument or why not having context dependent nodes justifies the extra maintainance cost (I'm not against that of course, and I am sympathetic, but I think this needs some more extensive coverage or why we are doing this). Notice that we already have context dependent nodes and attributes, as well as implicit meaning of some values like None.

    @isidentical
    Copy link
    Sponsor Member

    Batuhan, perhaps we should change the linked issue for your AST validator PR to this one. That way we can close the old catch-all PEP-634 implementation issue and keep the discussion focused here.

    I think it might be better as a separate issue that has a dependency on this one. I've created bpo-43897 for this.

    @ncoghlan
    Copy link
    Contributor Author

    https://www.python.org/dev/peps/pep-0642/#representing-patterns-explicitly-in-the-abstract-syntax-tree covers the rationale for why it is potentially problematic to reuse expression nodes for match patterns: it doesn't semantically align with the fact that patterns are *not* expressions, so the AST ends up being closer to a concrete syntax tree than we would like.

    For the specifics of the AST nodes:

    • restricting sub expressions directly in the AST itself is tricky without side effects on the AST for non-pattern-matching code, so I'm intending to add any such restrictions that we want to enforce to the validator (other statements and expressions like yield, await, return, break, and continue already have separately enforced restrictions like that)
    • cls needs to be an expression node to allow for attribute lookups
    • the linked draft PR already updates the grammar to emit the new AST. MatchClass is able to collect the keyword args and their corresponding patterns in roughly the same way that MatchMapping collects its key:pattern pairs

    @ncoghlan
    Copy link
    Contributor Author

    The draft PR moved because I messed up the previous branch name: https://github.com/ncoghlan/cpython/pull/9/files

    It's to the point where everything except symtable.c compiles.

    I put an initial skeleton of a validator in, but only enough to check that the parallel sequence lengths in the class and mapping pattern nodes are consistent - there's still plenty to be done in Batuhan's PR (e.g. because the AST constant folding now just delegates to the expression folding functions for MatchValue values and MatchMapping keys, this iteration only enforces the restrictions on permitted subexpressions in the surface syntax).

    In addition to Brandt's MatchStar suggestion, I've also tweaked the MatchClassparameter names to better match PEP-634 (the old names were based on PEP-642's attribute pattern concept, so "extra_" made sense, but "kwd_" makes more sense for PEP-634):

        pattern = MatchAlways
             | MatchValue(expr value)
             | MatchConstant(constant value)
             | MatchSequence(pattern* patterns)
             | MatchMapping(expr* keys, pattern* patterns)
             | MatchClass(expr cls, pattern* patterns, identifier* kwd_attrs, pattern* kwd_patterns)
         | MatchStar(identifier? target)
         -- A NULL entry in the MatchMapping key list handles capturing extra mapping keys
    
         | MatchAs(pattern? pattern, identifier target)
         | MatchOr(pattern* patterns)
          attributes (int lineno, int col_offset, int? end_lineno, int? end_col_offset)
    

    I think the idea of making the MatchMapping node signature "MatchMapping(expr* keys, pattern* patterns, identifier? rest)" has merit, but I'd like to get this initial iteration reviewed first to minimise the diff in the compiler_pattern_mapping implementation (right now it is fairly clear that the code generation for mapping patterns hasn't changed, but that will become significantly less obvious if/when the "**rest" handling changes away from working the same way _PyAST_Dict works)

    @isidentical
    Copy link
    Sponsor Member

         | MatchStar(identifier? target)
    

    +1 on MatchStar(). Much more similiar to the existing node names, and also less semantically-named.

          attributes (int lineno, int col_offset, int? end_lineno, int? end_col_offset)
    

    I'm not really sure whether we should continue following this old tradition of int? end_* attributes for new sum types like pattern, considering that it was done for supporting the creation of old nodes 3.8< without any change on 3.8>=. pattern subclasses will only be created on 3.10>= so we should be safe by requiring end_* attributes.

    @pablogsal
    Copy link
    Member

    I'm not really sure whether we should continue following this old tradition of int? end_* attributes for new sum types like pattern, considering that it was done for supporting the creation of old nodes 3.8< without any change on 3.8>=. pattern subclasses will only be created on 3.10>= so we should be safe by requiring end_* attributes.

    Yeah, it will be a requirement very soon actually.

    @ncoghlan
    Copy link
    Contributor Author

    I've removed the "?" from the end attributes for pattern nodes.

    The draft PR branch now compiles, but I've clearly made a mistake somewhere as it segfaults when trying to compile test_patma.py.

    @ncoghlan
    Copy link
    Contributor Author

    My proposed MatchConstant node name was bothering me, as most constants are actually matched by equality in MatchValue, the same way variables are.

    So I'm switching my proposed name for that node to be MatchSingleton, since that better reflects the key difference with MatchValue (i.e. comparing by identity rather than by value, and specifically comparing with the builtin singletons).

    @ncoghlan
    Copy link
    Contributor Author

    Segfaults are fixed (I had a couple of cases where I wasn't checking for NULL, and another where I was looking up MatchMapping attributes on a MatchClass node).

    test_patma passes except for test cases 240, 241, and 242, which look like genuine regressions - the logic to reject illegal MatchValue nodes is missing from the code generator side, and these particular values make it through the parser OK. I'll need to add back some of the code I took out so these still get rejected.

    @ncoghlan
    Copy link
    Contributor Author

    To get the "0+0" matching case rejected again I had to flesh out a bit more of the AST validator. This initial piece of the validator replaces the previously pattern-matching-specific AST optimisation code that refused to fold BinOp expressions that weren't complex numbers, allowing the compiler to reject them.

    I also added two new tests (240b and 241b) to ensure that 0+0 & f"" were also rejected as mapping keys. 241b passes with the current PR, as f"" gets rejected by both the check in the AST validator *and* by the "constant or attribute lookup" restriction in the compiler, so the latter check covers for the leaky validator.

    240b is disabled for now, as it won't pass until the AST validator checks all the subexpressions being used as keys in a mapping pattern (because it gets constant folded to "0", the check in the compiler accepts it).

    That leaves fixing the unparser tests as the key thing to do before I create the main PR.

    Before that, though, I'll look at potentially simplifying the MatchMapping code by changing the signature as Brandt suggested.

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan@54940a3 implements Brandt's suggestion of disallowing NULL keys in MatchMapping and instead passing in an explicit "rest" identifier to capture the extra keys.

    This did require moving to a multi-branch pattern definition in the grammar, similar to class patterns, in order to enforce the separating comma when both ordinary key:pattern pairs and a double-star target are present. I think it's a nice ergonomic improvement on the AST node itself though, as it makes it trivial to check if a mapping patterns captures extra keys or not (vs the relatively non-obvious check to see if the last key is NULL/None).

    That just leaves updating the unparser to handle the new node types.

    Capturing the latest node definitions from the draft PR:

        pattern = MatchAlways
             | MatchValue(expr value)
             | MatchSingleton(constant value)
             | MatchSequence(pattern* patterns)
             | MatchMapping(expr* keys, pattern* patterns, identifier? rest)
             | MatchClass(expr cls, pattern* patterns, identifier* kwd_attrs, pattern* kwd_patterns)
         | MatchStar(identifier? target)
         -- The optional "rest" MatchMapping parameter handles capturing extra mapping keys
    
         | MatchAs(pattern? pattern, identifier target)
         | MatchOr(pattern* patterns)
          attributes (int lineno, int col_offset, int end_lineno, int end_col_offset)
    

    @ncoghlan
    Copy link
    Contributor Author

    Setting clear timing expectations, given beta is so close: I'm not expecting to have time to finish the unparser work over the weekend, but I do expect to have time to finish it on Monday evening my time (UTC+10).

    @brandtbucher
    Copy link
    Member

    I can probably find time to take a pass at the unparser, if you want.

    cls needs to be an expression node to allow for attribute lookups

    That's why I suggested "identifier*"! Each identifier in the list would be one part of the dotted name, so we wouldn't need to worry about having arbitrary expressions:

    Foo() -> MatchClass(["Foo"], [], [], [])
    foo.bar.Baz() -> MatchClass(["foo", "bar", "Baz"], [], [], [])

    Not a big deal, though.

    @ncoghlan
    Copy link
    Contributor Author

    Interesting idea - I had missed that you were suggested "identifier*" for the cls node. It's simple enough to check for Name-or-Attribute in the AST validator though, and sticking with a standard expr_ty node means getting a lot of standard handling in the rest of the compiler and the unparser.

    My commitments today finished earlier than I expected, so I'm about to tackle the unparser now.

    @ncoghlan
    Copy link
    Contributor Author

    Working on the unparser picked up an inconsistency I had previously missed: aside from named expressions, "target" attributes in the AST are generally full expression subnodes, rather than identifier attributes. Identifier attributes on nodes are typically called "name", including in the originally implemented MatchAs node.

    Accordingly, I've switched both MatchStar and MatchAs over to having "name" attributes instead of "target" attributes.

    @ncoghlan
    Copy link
    Contributor Author

    PR against the main repo is now up: #25585

    A couple of things in the unparser tripped me up (ast_unparse.c only handles annotation expressions, the way ast._Unparser.visit() is defined means NodeVisitor.generic_visit can't be used), so the PR also includes some clarifying comments for future editors.

    @ncoghlan
    Copy link
    Contributor Author

    PR is green now, and I don't have any further changes I want to make, so Brandt, please feel free to merge if you're happy with the proposed AST nodes, and the way I've gone about implementing them.

    If you'd like some adjustments, it probably makes the most sense to just amend the PR directly - otherwise the 24-hour turnaround on comments due to the time zone difference will potentially give us trouble with the beta deadline.

    @ncoghlan ncoghlan assigned brandtbucher and unassigned ncoghlan Apr 26, 2021
    @ncoghlan ncoghlan assigned brandtbucher and unassigned ncoghlan Apr 26, 2021
    @brandtbucher
    Copy link
    Member

    Sounds good. I'll probably be able to make time to review it today, or tomorrow at the latest.

    @brandtbucher
    Copy link
    Member

    New changeset 1e7b858 by Nick Coghlan in branch 'master':
    bpo-43892: Make match patterns explicit in the AST (GH-25585)
    1e7b858

    @ncoghlan
    Copy link
    Contributor Author

    After the next docs build, the documentation for the explicit AST nodes should be available https://docs.python.org/dev/library/ast.html#pattern-matching

    @brandtbucher
    Copy link
    Member

    New changeset dbe60ee by Brandt Bucher in branch 'master':
    bpo-43892: Validate the first term of complex literal value patterns (GH-25735)
    dbe60ee

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants