Issue 42729: tokenize, ast: No direct way to parse tokens into AST, a gap in the language processing pipiline

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/86895

classification

Title:	tokenize, ast: No direct way to parse tokens into AST, a gap in the language processing pipiline
Type:		Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.10

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	BTaskaya, lys.nikolaou, pablogsal, pfalcon, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2020-12-24 10:19 by pfalcon, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 23922	closed	pfalcon, 2020-12-24 10:26

Messages (11)
msg383680 - (view)	Author: Paul Sokolovsky (pfalcon) *	Date: 2020-12-24 10:19
Currently, it's possible: * To get from stream-of-characters program representation to AST representation (AST.parse()). * To get from AST to code object (compile()). * To get from a code object to first-class function to the execute the program. Python also offers "tokenize" module, but it stands as a disconnected island: the only things it allows to do is to get from stream-of-characters program representation to stream-of-tokens, and back. At the same time, conceptually, tokenization is not a disconnected feature, it's the first stage of language processing pipeline. The fact that "tokenize" is disconnected from the rest of the pipeline, as listed above, is more an artifact of CPython implementation: both "ast" module and compile() module are backed by the underlying bytecode compiler implementation written in C, and that's what connects them. On the other hand, "tokenize" module is pure-Python, while the underlying compiler has its own tokenizer implementation (not exposed). That's the likely reason of such disconnection between "tokenize" and the rest of the infrastructure. I propose to close that gap, and establish an API which would allow to parse token stream (iterable) into an AST. An initial implementation for CPython can (and likely should) be naive, making a loop thru surface program representation. That's ok, again, the idea is to establish a standard API to be able to go tokens -> AST, then individual Python implementation can make/optimize it based on their needs. The proposed name is ast.parse_tokens(). It follows the signature of the existing ast.parse(), except that first parameter is "token_stream" instead of "source". Another alternative would be to overload existing ast.parse() to accept token iterable. I guess, at the current stage, where we try to tighten up type strictness of API, and have clear typing signatures for API functions, this is not favored solution.
msg383681 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2020-12-24 10:34
> I propose to close that gap, and establish an API which would allow to parse token stream (iterable) into an AST. An initial implementation for CPython can (and likely should) be naive, making a loop thru surface program representation. There is different aspects of this problem (like maintenance cost of either exposing the underlying tokenizer, or building something like Python-ast.c to convert these 2 different token types back and forth which I'm big -1 on both of them.) but the thing I don't quite get is the use case. What prevents you from using ast.parse(tokenize.untokenize(token_stream))? It is guaranteed that you won't miss anything (in terms of the position of tokens) (since it almost roundtrips every case). Also, tokens -> AST is not the only disconnected part in the underlying compiler. Stuff like AST -> Symbol Table / AST -> Optimized AST etc. is also not available, and apparently not needed (since nobody else, maybe except me [about the AST -> ST conversion], complained about these being missing). I'd also suggest moving the discussion to the Python-ideas, for a much greater audience.
msg383682 - (view)	Author: Paul Sokolovsky (pfalcon) *	Date: 2020-12-24 10:54
> What prevents you from using ast.parse(tokenize.untokenize(token_stream))? That's exactly the implementation in the patch now submitted against this issue. But that's the patch for CPython, the motive of the proposal here is to establish a standard API call for Python, which different implementation can implement how they like/can/need. > Also, tokens -> AST is not the only disconnected part in the underlying compiler. We should address them, one by one. > Stuff like AST -> Symbol Table Kinda yes, again, based on CPython implementation history, we have only source -> Symbol table (https://docs.python.org/3/library/symtable.html). Would be nice to address that (separately of course). > AST -> Optimized AST Yes. PEP511 touched on that, but as it-as-a-whole was rejected, any useful sub-ideas from it don't seem to get further progress either (like, being able to disable some optimizations, and then maybe even exposing them as standalone passes). > I'd also suggest moving the discussion to the Python-ideas, for a much greater audience. That's how I usually do, but I posted too much there recently. I wanted to submit a patch right away, but noticed that standard commit message format is "bpo-XXXXX: ...", so I created a ticket here to reference in the commit.
msg383683 - (view)	Author: Paul Sokolovsky (pfalcon) *	Date: 2020-12-24 11:02
> but the thing I don't quite get is the use case. And if that went unanswered: the usecase, how I'd formulate it, is to not expose CPython historical implementation detail of "tokenize" being disconnected from the rest of the language processing infrastructure, and make them learn tricks like needing to go back to character program form if they ever start to use "tokenize", but let it all integrate well into single processing pipeline.
msg383684 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2020-12-24 11:05
> That's exactly the implementation in the patch now submitted against this issue. But that's the patch for CPython, the motive of the proposal here is to establish a standard API call for Python, which different implementation can implement how they like/can/need. I don't feel great about it, but if you are final motive is to address this issue for other implementations (like pycopy?), I still think that Python-ideas is the best place to discuss it rather than the bugtracker of CPython. > We should address them, one by one. I am not sure about that, IIRC @pablogsal and I talked these about a year ago and decided that there won't be any clear benefit (since nobody come with a need to it) and there will the downside of sometimes limiting ourselves for being backward-compatible for an API that ~almost no one will use. > That's how I usually do, but I posted too much there recently. I wanted to submit a patch right away, but noticed that standard commit message format is "bpo-XXXXX: ...", so I created a ticket here to reference in the commit. What people does in this cases is, they push to their branches with a random commit message (Implement blabla) and then share the link to the branch on their post in the Python-ideas.
msg383687 - (view)	Author: Lysandros Nikolaou (lys.nikolaou) *	Date: 2020-12-24 12:50
The thing is that the parser itself does not get a stream of tokens as input. It only accepts either a file or a string and it lazily converts its input to tokens. As for the PR attached to this patch, I'm -1 on that. I don't think the usecase is common enough for us to have another public function, that we need to maintain and keep backwards-compatible. I concur with Batuhan that if people need this, they can use ast.parse with tokenize.untokenize.
msg383688 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-12-24 13:14
I am with Lysandros and Batuhan. The parser is considerably coupled with the C tokenizer and the only way to reuse the parser is to make flexible enough to receive a token stream of Python objects as input and that can not only have a performance impact on normal parsing but also raises the complexity of this task considerably, especially taking into account that the use case is quite restricted and is something that you can already achieve by transforming the token stream into text and using ast.parse. There is a considerable tension on exposed parts of the compiler pipeline for introspection and other capabilities and our ability to do optimizations. Given how painful it has been in the past to deal with this, my view is to avoid exposing as much as possible anything in the compiler pipeline, so we don't shoot ourselves in the foot in the future if we need to change stuff around.
msg383689 - (view)	Author: Paul Sokolovsky (pfalcon) *	Date: 2020-12-24 13:30
> There is a considerable tension on exposed parts of the compiler pipeline for introspection and other capabilities and our ability to do optimizations. Given how painful it has been in the past to deal with this, my view is to avoid exposing as much as possible anything in the compiler pipeline, so we don't shoot ourselves in the foot in the future if we need to change stuff around. That's somewhat extreme outcome when a problem is known and understood, but the best solution is deemed not doing anything. But the problem of "doing it wrong" definitely known and exists. One known way to address it is to design generic interfaces and implement them. This ticket is exactly about that - defining a missing interface for a parser in Python. It's not about the CPython's C parser and its peculiarities. (But even it fits with the generic interface proposed.)
msg383690 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-12-24 13:36
I concur with other core developers. Seems there is no real need for this feature and the idea was proposed purely to "close a gap". Taking into account significant cost of implementing and maintaining this feature, I think that it should not be added.
msg383691 - (view)	Author: Paul Sokolovsky (pfalcon) *	Date: 2020-12-24 13:55
> the idea was proposed purely to "close a gap" That pinpoints it well. I was just writing a tutorial on implementing custom import hooks, with the idea to show people how easy it to do it in Python. As first step, I explained that it's bad idea to do any transformations on surface representation of a program. At the very least, it should be converted to token stream. But then I found that I need to explain that we need to convert it back, which sounds pretty weird and undermines the idea: def xform(token_stream): for t in token_stream: if t[0] == tokenize.NAME and t[1] == "function": yield (tokenize.NAME, "lambda") + t[2:] else: yield t with open(filename, "rb") as f: # Fairly speaking, tokenizing just to convert back to string form # isn't too efficient, but CPython doesn't offer us a way to parse # token stream so far, so we have no choice. source = tokenize.untokenize(xform(tokenize.tokenize(f.readline))) mod = type(imphook)("") exec(source, vars(mod)) return mod Having written that comment, I thought I could as well just make one more step and monkey-patch "ast" for parse_tokens() function - I'll need to explain that, but the explanation probably wouldn't sound worse than the explanation above. And then it was just one more step to actually submit patch for ast.parse_tokens(), and that's how this ticket was created!
msg383692 - (view)	Author: Batuhan Taskaya (BTaskaya) *	Date: 2020-12-24 13:56
Thank you for your patch though!

History
Date	User	Action	Args
2022-04-11 14:59:39	admin	set	github: 86895
2020-12-24 13:56:52	BTaskaya	set	messages: + msg383692
2020-12-24 13:55:12	pfalcon	set	messages: + msg383691
2020-12-24 13:36:38	serhiy.storchaka	set	status: open -> closed resolution: rejected messages: + msg383690 stage: resolved
2020-12-24 13:30:34	pfalcon	set	messages: + msg383689
2020-12-24 13:14:59	pablogsal	set	messages: + msg383688
2020-12-24 12:50:56	lys.nikolaou	set	nosy: + lys.nikolaou messages: + msg383687
2020-12-24 11:05:30	BTaskaya	set	messages: + msg383684
2020-12-24 11:02:36	pfalcon	set	messages: + msg383683
2020-12-24 10:54:42	pfalcon	set	messages: + msg383682
2020-12-24 10:34:15	BTaskaya	set	messages: + msg383681 stage: patch review -> (no value)
2020-12-24 10:26:19	pfalcon	set	keywords: + patch stage: patch review pull_requests: + pull_request22773
2020-12-24 10:19:55	pfalcon	create