classification
Title: Add optional type information to asdl_seq objects
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: BTaskaya, gvanrossum, lys.nikolaou, pablogsal, rhettinger
Priority: normal Keywords: patch

Created on 2020-09-08 20:31 by pablogsal, last changed 2020-10-21 19:55 by lys.nikolaou. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 22223 merged pablogsal, 2020-09-13 02:43
PR 22786 closed lys.nikolaou, 2020-10-19 21:51
PR 22864 merged lys.nikolaou, 2020-10-21 19:03
Messages (21)
msg376592 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-09-08 20:31
Casting incorrect elements pulled from asdl_seq objects has been a pain when developing the new PEG parser and is a source of spooky-bug-at-a-distance problems in which the consequences of an incorrect casting from void* are felt much later.

I propose to add a new field to asdl_seq objects, which is an enumeration of the possible types it can contain. The main ideas are:

* By default, the enumeration will have UNDEFINED type.
* We can add some extra macros mirroring asdl_seq_GET and asdl_seq_SET that will receive the expected type and in debug mode will assert that the type is correct. Something like:

expr_ty item = asdl_set_GET_TYPED(sequence, n, EXPR);

* Usage of asdl_seq_GET and asdl_seq_SET do not do extra checks.

* To set the type information, we can add a new constructor:

asdl_seq_new_typed(size, arena, TYPE);

I think this change is worth because is not very invasive (old usage remains the same), we can slowly migrate only the parts that we need/want and will add some extra debugging possibilities for cases that has been quite challenging.
msg376595 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-08 20:45
Let me know if I'm misunderstanding the proposal.  Are you proposing a non-standard ASDL extension to that won't work with existing ASDL parsers?  I thought the entire point of ASDL is that it was standard, portable, and had a fixed number of types easily implemented in many languages.

FWIW, Here are some of my notes from a compiler project I worked on a while back:
# ASDL's seven builtin types are:
# - identifier, int, string, bytes, object, singleton, constant
# - singleton: None, True or False
# - constant can be None, whereas None means "no value" for object
#
# Resources for ASDL:
# https://www.usenix.org/legacy/publications/library/proceedings/dsl97/full_papers/wang/wang.pdf
# https://eli.thegreenplace.net/2014/06/04/using-asdl-to-describe-asts-in-compilers
# https://www.oilshell.org/blog/2016/12/11.html
msg376596 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-08 20:49
Within standard ASDL, the sequences already have an ASDL type.  Here's a snippet that shows both builtin types and derived types:

start = Program(stmt* procs, expr* calls)

stmt = Procedure(identifier name, identifier* params, bool is_test, stmt body)
     | Block(int blocknum, stmt* stmts)
     | Assign(lvalue target, expr value)
     | If(expr cond, stmt action)
     | Loop(int times, bool fixed, body stmt)
     | AbortLoop(int blocknum)
     | QuitBlock(int blocknum)

expr = BinOp(expr value1, str op, expr value2)
     | Number(int x)
     | Bool(bool x)
     | Id(identifier name)
     | Call(identifier name, expr* args)
     | lvalue

lvalue = Output(bool is_bool)
       | Cell(int i)
msg376597 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python triager) Date: 2020-09-08 20:51
We already have a specialized type for int (actually an enumeration);
typedef struct {
    Py_ssize_t size;
    void *elements[1];
} asdl_seq;

typedef struct {
    Py_ssize_t size;
    int elements[1];
} asdl_int_seq;

Why not just include these in the Python-ast.h and auto-generate them in the asdl_c.py
msg376598 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-09-08 20:52
> Let me know if I'm misunderstanding the proposal.  Are you proposing a non-standard ASDL extension to that won't work with existing ASDL parsers? 

No, I am not proposing any modification to the ASDL parsers nor the ASDL definition language.

I am proposing a modification to the internal data structures we use internally to carry sequences of ASDL types. Currently, asdl_seq is a type that holds an array of void* pointers and the user is supposed to cast the individual items to the correct types, which is error prone. I am proposing to add a new field to our internal structures that allows the user to have some certainty over what these sequences contain.
msg376599 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-09-08 20:53
> Why not just include these in the Python-ast.h and auto-generate them in the asdl_c.py

That's a possibility, but that would incur in a lot of new extra code, and some asdl_seq items in the parser contain non-standard types that only PEG uses, so is not possible to auto-generate them.
msg376600 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python triager) Date: 2020-09-08 20:53
The problem with asdl_seq_int was that it was hard-coded since there was no entity that transpods data between different traversers / node visitors. PR 20193 tries to implement a general-purpose metadata system to eliminate that issue, and can be easily extended for this too!
msg376601 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python triager) Date: 2020-09-08 20:54
> That's a possibility, but that would incur in a lot of new extra code, and some asdl_seq items in the parser contain non-standard types that only PEG uses, so is not possible to auto-generate them.

Well, since the extra code will be autogenerated, guess it wouldn't be much problematic. But I see about the PEG thing.
msg376602 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-09-08 20:57
> Well, since the extra code will be autogenerated, guess it wouldn't be much problematic. 

It will because it means that generic functions that receive asdl_seq an need to cast aliased pointers because they will still receive asdl_seq* items and you now need to cars them to the appropriate type:

int foo(asdl_seq* my_seq) {
    asdl_seq_expr* = (asdl_seq_expr*)my_seq
}

So you will end up with what I am proposing: a flag in asdl_seq that tells you of what type it is.
msg376603 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-08 21:04
> No, I am not proposing any modification to the ASDL parsers nor 
> the ASDL definition language.  I am proposing a modification to 
> the internal data structures we use internally to carry sequences
> of ASDL types.

Thanks for the clarification.  Carry on :-)
msg376604 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-09-08 21:15
> Well, since the extra code will be autogenerated, guess it wouldn't be much problematic. But I see about the PEG thing.


On the other hand, having explicit types for the sequences will allow us to get compile errors instead of runtime errors for the standard types, so this may be a good possibility. Maybe some mixture of both:

* asdl_seq objects can contain an enum describing what they are.
* We have specific specializations for adding to ASDL types.
msg376606 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-09-08 21:53
I know it's moot now, but still -- what benefit do we get from using a "standard" like ASDL? All our tooling for it is custom for Python only.
msg376608 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-09-08 22:04
> I know it's moot now, but still -- what benefit do we get from using a "standard" like ASDL? All our tooling for it is custom for Python only.

I think there are other tools or implementations of Python that use standard parsers for our ASDL file.

In any case, AFAIU we don't need to modify the current ASDL definition as it already has type information that we need. For instance:

     | Block(int blocknum, stmt* stmts)

that says that stmts is a sequence of stmt objects, but unfortunately, this is translated as an "asdl_seq" type, losing the information that it can only contain stmt types. This is what I am planning to address here.
msg376615 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-09 01:43
> I know it's moot now, but still -- what benefit do we get 
> from using a "standard" like ASDL?

The "standard" part of it isn't important.  AFAICT, ASDL has a low adoption rate and is not maintained.

IMO, the part that matters is that ADSL was carefully balanced to be sufficiently expressive while keeping it easy to implement and easy to automatically translate into different languages.  Presumably, this will not only help other Python implementations and third-party tooling, it will also make life easier for us in the long run.

My understanding of the origin of ASDL is that it aspired to solve a common problem in language design where people commonly described their abstract syntax in way that was too tightly bound to underlying implementation language.  This caused long-run problems when reimplementing in other languages and when trying to automatically update downstream tools that interoperate with the AST.

In this regard, my personal experience with ASDL has been favorable.  I view it as the JSON spec of the AST world, intentionally minimal yet expressive.

That said, I think it failed to establish itself as a standard and almost no tooling was created for it.  The original authors expected that ASDL would sit side-by-side with BNF and regex notation.  That was a pipe dream.
msg376617 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-09-09 03:40
I guess some of my gripes about ASDL have to do with our tooling. For example I find it annoying that the 'kind' enums overlap, so if I have a void* that I know points to an AST node I can't look at the kind to tell what it is.

Anyway, regarding Pablo's proposal, I think that if we do it, we should do a full migration, not a piecemeal one. I think the vast majority of asdl_seq_GET calls are immediately preceded by a cast anyway, and the majority of asdl_seq_SET calls are in generated code.
msg376619 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-09-09 05:55
That all makes sense (including the full migration).
msg377010 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-09-16 18:42
New changeset a5634c406767ef694df49b624adf9cfa6c0d9064 by Pablo Galindo in branch 'master':
bpo-41746: Add type information to asdl_seq objects (GH-22223)
https://github.com/python/cpython/commit/a5634c406767ef694df49b624adf9cfa6c0d9064
msg378962 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-10-19 17:06
Pablo, can this be closed now that PR22223 is merged?
msg379001 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-10-19 19:45
> Pablo, can this be closed now that PR22223 is merged?

Nop, we still need to fix the CHECK macros that do type erasure (they cast the result to void).
msg379239 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-10-21 19:53
New changeset 2e5ca9e3f68b33abb7d2c66d22ffc18dec40641a by Lysandros Nikolaou in branch 'master':
bpo-41746: Cast to typed seqs in CHECK macros to avoid type erasure (GH-22864)
https://github.com/python/cpython/commit/2e5ca9e3f68b33abb7d2c66d22ffc18dec40641a
msg379240 - (view) Author: Lysandros Nikolaou (lys.nikolaou) * (Python committer) Date: 2020-10-21 19:55
Last bit of work is now done.
History
Date User Action Args
2020-10-21 19:55:02lys.nikolaousetstatus: open -> closed
resolution: fixed
messages: + msg379240

stage: patch review -> resolved
2020-10-21 19:53:22lys.nikolaousetmessages: + msg379239
2020-10-21 19:03:08lys.nikolaousetpull_requests: + pull_request21806
2020-10-19 21:51:20lys.nikolaousetpull_requests: + pull_request21743
2020-10-19 19:45:38pablogsalsetmessages: + msg379001
2020-10-19 17:06:43lys.nikolaousetmessages: + msg378962
2020-09-16 18:42:07pablogsalsetmessages: + msg377010
2020-09-13 02:43:25pablogsalsetkeywords: + patch
stage: patch review
pull_requests: + pull_request21278
2020-09-09 05:55:04rhettingersetmessages: + msg376619
2020-09-09 03:40:12gvanrossumsetmessages: + msg376617
2020-09-09 01:43:01rhettingersetmessages: + msg376615
2020-09-08 22:04:09pablogsalsetmessages: + msg376608
2020-09-08 21:53:02gvanrossumsetmessages: + msg376606
2020-09-08 21:15:34pablogsalsetmessages: + msg376604
2020-09-08 21:04:39rhettingersetmessages: + msg376603
2020-09-08 20:57:07pablogsalsetmessages: + msg376602
2020-09-08 20:54:32BTaskayasetmessages: + msg376601
2020-09-08 20:53:25BTaskayasetmessages: + msg376600
2020-09-08 20:53:11pablogsalsetmessages: + msg376599
2020-09-08 20:52:02pablogsalsetmessages: + msg376598
2020-09-08 20:51:28BTaskayasetmessages: + msg376597
2020-09-08 20:49:03rhettingersetmessages: + msg376596
2020-09-08 20:46:23BTaskayasetnosy: + BTaskaya
2020-09-08 20:45:50rhettingersetnosy: + rhettinger
messages: + msg376595
2020-09-08 20:31:17pablogsalcreate