Issue 33337: Provide a supported Concrete Syntax Tree implementation in the standard library

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77518

classification

Title:	Provide a supported Concrete Syntax Tree implementation in the standard library
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	BTaskaya, benjamin.peterson, edreamleo, ethan smith, gregory.p.smith, gvanrossum, jimmylai, jwilk, kernc, levkivskyi, lukasz.langa, lys.nikolaou, n_rosenstein, njs, serhiy.storchaka, zsol
Priority:	normal	Keywords:	patch

Created on 2018-04-23 01:04 by lukasz.langa, last changed 2022-04-11 14:58 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 6572	closed	lukasz.langa, 2018-04-24 03:54

Messages (18)
msg315638 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-23 01:04
Python includes a set of batteries that enable parsing of Python code. This includes its own AST (provided in the standard library under the `ast` module), as well as a pure Python tokenizer (provided in the standard library under `tokenize` and `token`). It also provides an undocumented CST under lib2to3, which contains its own outdated and patched copies of `tokenize` and `token`. This situation causes the following issues for users of Python: - the built-in AST does not preserve comments or whitespace; - the built-in AST increasingly modifies the tree before presenting it to user code (constant folding moved to the AST in Python 3.7); - the built-in tokenize.py can only be used to parse Python 3.7+ code; - the version in lib2to3 is partially customized and partially outdated, leaving bits of new grammar not supported; new bits of grammar very often get overlooked in lib2to3. - lib2to3 is not documented. So if users want to write tools that manipulate Python code, the standard library doesn't provide them with great options. I suggest the following plan: 1. Bring Lib/lib2to3/pgen2/tokenize.py to the same state as Lib/tokenize.py (leaving the bits that allow for parsing of Python 3.6 and older files). 2. Merge the two tokenizers in Python 3.8 so that Lib/tokenize.py now officially supports tokenizing Python 2.7 - 3.7 code. 3. Update Lib/lib2to3/pgen2 and move it under Lib/pgen. Document it as the built-in CST provided by Python for use in applications which require code modification. Make it still officially support parsing of Python 2.7 - 3.7 code. All three changes are made in a backwards-compatible fashion, existing code should NOT break. That being said, the parser under Lib/pgen might grow some new behavior compared to the compatibility mode for lib2to3, I specifically seek to improve handling of comments and error recovery.
msg315642 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-23 01:28
See BPO-33338 for an implementation of Step 1.
msg315646 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2018-04-23 04:48
I'm glad you've rediscovered pgen2! I'm in favor of unifying the tokenizers and of updating and moving pgen2 (though I don't have time to do the work). I'm not sure if it's technically possible to give tokenize.py the ability to tokenize Python 2.7 and up without some version-selection flag -- have you researched this part yet? Also I think you may have to make a distinction between the parser generator and its data structures, and between the generated parser for Python vs. the parser for other LL(1) grammars one might feed into it. And I don't think you're proposing to replace Parser/pgen.c with Lib/pgen/, right? Nor to replace the CST actually used by CPython's parser with the data structures used by pgen2's driver. So the relationship between the CST you propose to document and CPython internals wouldn't be quite the same as that between the AST used by CPython and the ast module (since those do actually use the same code).
msg315647 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-04-23 07:01
> - the built-in AST increasingly modifies the tree before presenting it to user > code (constant folding moved to the AST in Python 3.7); These modification are applied only before bytecodecode generation. The AST presented to user is not modified. > - the built-in tokenize.py can only be used to parse Python 3.7+ code; Is this a problem? 2.7 is a dead Lib/lib2to3/pgen2/tokenize.pyend, its support will be ended in less than 2 years. Even 3.6 will be moved to a security only fixes stage short time after releasing 3.8. I'm in favor of updating Lib/lib2to3/pgen2/tokenize.py, but I don't understand why Lib/tokenize.py should parse 2.7. I'm in favor of reimplementing pgen in Python if this will simplify the code and the building process. Python code is simpler than C code, this code is not performance critical, and in any case we need an external Python when modify grammar of bytecode. See also issue30455 where I try to get rid of duplications by generating all tokens-related data and code from a single source (token.py or external text file). For what purposes the CST is needed besides 2to3? I know only that it could help to determine the correct position in docstrings in doctests and similar tools which need to process docstrings and report errors. This is not possible with AST due to inlined '\n', escaped newlines, and string literals concatenation. Changes in 3.7 made this even worse (see issue32911).
msg315648 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-23 07:36
> I'm in favor of unifying the tokenizers and of updating and moving pgen2 (though I don't have time to do the work). I'm willing to do all the work as long as I have somebody to review it. Case in point: BPO-33338. > Also I think you may have to make a distinction between the parser generator and its data structures, and between the generated parser for Python vs. the parser for other LL(1) grammars one might feed into it. Technically pgen2 has the ability to parse any LL(1) grammar but so far the plumbing is tightly tied to the tokenizer. We'd need to enable plugging that in, too. > And I don't think you're proposing to replace Parser/pgen.c with Lib/pgen/, right? No, I'm not. > Nor to replace the CST actually used by CPython's parser with the data structures used by pgen2's driver. No, I'm not. > So the relationship between the CST you propose to document and CPython internals wouldn't be quite the same as that between the AST used by CPython and the ast module (since those do actually use the same code). Right. Once we unify the standard library tokenizers (note: not tokenizer.c which will stay), there wouldn't be much extra documentation to write for Lib/tokenize.py. For Lib/pgen/ itself, we'd need to provide both an API rundown and an intro to the high-level functionality (how to create trees from files, string, etc.; how to visit trees and edit them; and so on). > I'm not sure if it's technically possible to give tokenize.py the ability to tokenize Python 2.7 and up without some version-selection flag -- have you researched this part yet? There's two schools. This is going to take a while to explain :) One school is to force the caller to declare what Python version they want to parse. This is algorithmically cleaner because we can then literally take Grammar/Grammar from various versions of Python and have the user worry about picking the right one. The other school is what lib2to3 does currently, which is to try to implement as much of a superset of Python versions as possible. This is way easier to use because the grammar is very forgiving. However, this has limitations. There are three major incompatibilities that we need to deal with, with raising degree of severity: - async/await; - print statements; - exec statements. Async and await became proper keywords in 3.7 and thus broke usage of those as names. It's relatively easy to work around this one seamlessly by keeping the grammar trickery we've had in place for 3.5 and 3.6. This is what lib2to3 does today already👍🏻 The print statement is fundamentally incompatible with the print function. lib2to3 has two grammar variants and most users by default choose the one without the print statement. Why? Because it cannot be reliably sniffed anymore. Python 3-only code will not use the __future__ import. In fact, 2to3 also doesn't do auto-detection, relies on the user running `2to3 -p` to indicate they mean the grammar with the print function. The exec statement is even worse because there isn't even a __future__ import. It's annoying because it creates a third combination. 👎🏻 So now the driver has to attempt three grammars (in this order): - the almost compatible combined Python 2 + Python 3 one (that assumes exec is a function and print is a function); - the one that assumes exec is a statement but print is still a function (because __future__ import); - the one that exposes the legacy exec and print statements. This approach has one annoying wart. Imagine you have a file like this: print('msg', file=sys.stderr) if Now the driver will attempt all three grammars and fail, and will report that the parse error is on the print line. This can be overcome by comparing syntax errors from each grammar and showing the one on the furthest line (which is the most likely to be the real culprit). But it's still annoying and will sometimes not do what the user wanted. -- OK, OK. So which to choose? And now, while this sounds like more work and is harder to get right, I still think the combined grammar with minimal incompatibilities is the better approach. Why? Two reasons. 1. Nobody ever knows what Python version exactly a given file is. Most files aren't even considering compatibility that fine-grained. And having to attempt to parse not three but potentially 8 grammars (3.7 - 3.2, 2.7, 2.6) would be prohibitively slow. 2. My tool maybe wants to actually modify the compatibility level by, say, rewriting ''.format() with f-strings or putting trailing commas where old Pythons didn't accept them. So it would be awkward if the grammar I used to read the file wasn't compatible with my later changes. Unless I'm swayed otherwise, I'd continue on what lib2to3 did, with the exception that we need to add a grammar variant without the `exec` statement, and the driver needs to attempt parsing with the three grammars on its own, with proper syntax error reporting.
msg315649 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-23 08:01
> These modification are applied only before bytecodecode generation. The AST presented to user is not modified. This bit me when implementing PEP 563 but I was then on the compile path, right. Still, the latest docstring folding would qualify as an example here, too, no? > Is this a problem? 2.7 is a dead end, its support will be ended in less than 2 years. Even 3.6 will be moved to a security only fixes stage short time after releasing 3.8. Yes, it is a problem. We will support Python 2 until 2020 but people will be running Python 2 code for a decade at least. We need to provide those people a way to move their code forward. Static analysis tools like formatters, linters, type checkers, or 2to3-style translators, are all soon going to run on Python 3. It would be a shame if those programs were barred from helping users that are still struggling on Python 2. A closer example is async/await. It would be a shame if running on Python 3.7 meant you can't write a tool that renames (or even just detects) invalid uses of async/await. I firmly believe that the version of the runtime should be indepedent of the version it's able to analyze. > I'm in favor of updating Lib/lib2to3/pgen2/tokenize.py, but I don't understand why Lib/tokenize.py should parse 2.7. Hopefully I sufficiently explained that above. > I'm in favor of reimplementing pgen in Python if this will simplify the code and the building process. Python code is simpler than C code, this code is not performance critical, and in any case we need an external Python when modify grammar of bytecode. Well, I didn't think about abandoning pgen. I admit that's mostly because my knee-jerk reaction was that it would be too slow. But you're right that this is not performance critical because every `pip install` runs `compileall`. I guess we could parse in "strict" mode for Python itself but allow for multiple grammars for standard library use (as I explained in the reply to Guido). And this would most likely give us opportunity to iterate on grammar improvements in the future. And yet, I'm cautious here. Even ignoring performance, that sounds like a more ambitious task from what I'm attempting. Unless I find partners in crime for this, I wouldn't attempt that. And I would need thumbs up from the BDFL and performance-wary contributors. > For what purposes the CST is needed besides 2to3? Anywhere where you need the full view of the code which includes non-semantic pieces. Those include: - whitespace; - comments; - parentheses; - commas; - strings prefixes. The main use case is linters and refactoring tools. For example mypy is using a modified AST to support type comments. YAPF and Black are based on lib2to3 because as formatters they can't lose comments, string prefixes, and organizational parentheses either. JEDI is using Parso, a lib2to3 fork, for similar reasons.
msg315678 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2018-04-23 20:51
+1 in general to this work. Łukasz is effectively preaching to the choir by looping me in here. :) It is a big challenge to practically support Python in that we have no good ability to parse and understand all language syntax versions via a single API that does not depend on the version of the language your own tools process is running under. lib2to3.pgen2 is the closest thing we've got and it used by a notable crop of python refactoring tools today because there really wasn't another available choice. All they know is that they've got a ".py" file, they can't know which specific language versions it may be intended for. Nor should they ever need to run _on_ that language version. That situation is a nightmare (ex: pylint uses ast and must run on the version of the language it is to analyze as) I'd love to see a ponycorn module that everything could use to run on top of Python 3.recent yet be able to meaningfully process 2.7 and 3.4-3.7 code. This is an area where the language versions we support parsing and analyzing should _not_ be limited to the current CPython org still supported releases. Does this need to go in the CPython project and integrate with its internals such as pgen.c or pgen2? I don't know. From my perspective this could be a PyPI project. Even if it seems odd that we have stdlib ast and lib2to3.pgen2 modules and pgen internal to CPython; at some point those could be seen as implementation details and made private in favor of tool application code using a canonical ponycorn thing on PyPI. The important part is that it is maintained and kept up to date with future language grammar changes while maintaining "backwards grammar compatibility".
msg315680 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-23 21:02
> The important part is that it is maintained and kept up to date with future language grammar changes while maintaining "backwards grammar compatibility". Yes, which is why I have trouble believing this can be effectively outsourced. Existing third-party libraries always stalled at some point in this regard.
msg315681 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2018-04-23 21:32
But lib2to3 is proof that the stdlib is just as much subject to stalling. Maybe lib2to3 and pgen2 would have a livelier future if they weren't limited to updates in sync with Python releases.
msg315682 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-23 23:28
> But lib2to3 is proof that the stdlib is just as much subject to stalling. The issue here is internal visibility. "lib2to3" is a library that supports "2to3" which is rather neglected internally since we started promoting `six` as a better migration strategy to Python 3. Most core devs don't even know new syntax is supposed to be added to lib2to3. Case in point: somehow Lib/tokenize.py was updated just in time for f-strings to be released but not Lib/lib2to3/pgen2/tokenize.py. By unifying the tokenizers and moving the CST out of lib2to3's guts (and documenting it as a supported feature!), I'm pretty sure we can eliminate the danger of forgetting to update it in the future.
msg315683 - (view)	Author: Nathaniel Smith (njs) *	Date: 2018-04-24 00:44
It does seem like it'd be unfortunate to end up in a situation like "sorry, there's a bug in handling this python 2 code, so black won't be able to reformat it until the next major python release". And I assume this issue is motivated by running into limitations of the current version; waiting for 3.8 before you can fix those seems unfortunate too? Another option to think about: make the library something that's maintained by python-dev, but released separately on PyPI.
msg315686 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2018-04-24 04:12
The stdlib is a bad place for anything that needs to evolve at a non-glacial place. For example, even when 2to3 had not yet fallen out of favor, there were effectively 3 versions of it: one 2.7 and two in maintained 3.x branches. That was a large pain. 2to3 also could only be updated as quickly as Python is released.
msg315704 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2018-04-24 15:09
Lukasz, pleased consider seriously to move to a 3rd party package. Even pgen2. On Mon, Apr 23, 2018, 21:12 Benjamin Peterson <report@bugs.python.org> wrote: > > Benjamin Peterson <benjamin@python.org> added the comment: > > The stdlib is a bad place for anything that needs to evolve at a > non-glacial place. For example, even when 2to3 had not yet fallen out of > favor, there were effectively 3 versions of it: one 2.7 and two in > maintained 3.x branches. That was a large pain. 2to3 also could only be > updated as quickly as Python is released. > > ---------- > stage: patch review -> > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue33337> > _______________________________________ >
msg315759 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-25 21:15
[njs] > "there's a bug in handling this python 2 code, so black won't be able to reformat it until the next major python release" Nah, we're still allowed to fix bugs in micro releases. We should have more of those instead of sitting on fixed bugs for months. That's a discussion for a different venue though. [gutworth] > The stdlib is a bad place for anything that needs to evolve at a non-glacial place. The syntax tree only needs to evolve to keep up with current Python development. That's why I think it makes sense to tie the two. [gvr] > please consider seriously to move to a 3rd party package Does that also invalidate the idea to merge the tokenizers? And if so, does that also invalidate the idea to update lib2to3's tokenizer (BPO-33338)?
msg315760 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2018-04-25 21:37
I think merging the tokenizers still makes sense. We can then document top-level tokenize.py (in 3.8 and later) as guaranteed to be able to tokenize anything going back to Python 2.7. And since lib2to3/pgen2 it is undocumented I presume removing lib2to3/pgen2/tokenize.py isn't going to break anything -- but if we worry about that it could be made into a trivial wrapper for top-level tokenize.py. Still, the improvements you're planning to lib2to3 (no matter how compatible) will benefit more people sooner if you extract it into its own PyPI package. Not everybody can upgrade to 3.7 as soon as Instagram. :-)
msg330924 - (view)	Author: Niklas Rosenstein (n_rosenstein) *	Date: 2018-12-03 10:17
Lukasz, have you created a 3rd party package branching off lib2to3? I'm working on a project that is based on it (in a similar usecase as YAPF and Black) and was hoping that there may be some version maintained distinctly from the Python release schedule.
msg364062 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2020-03-13 01:05
If people are looking for a concrete CST that works now, maybe LibCST will work? https://github.com/Instagram/LibCST
msg364096 - (view)	Author: Jimmy Lai (jimmylai) *	Date: 2020-03-13 13:58
Just found Guido mentioned LibCST. Here is a quick overview: 1. LibCST is an open source Python concrete syntax tree parser. It provides a CST looks like and feel like AST. 2. It's built by Instagram for linter and refactoring tools (exact use cases what Łukasz mentioned). We have a linter framework built on top of LibCST which allows a lint rule automatically fixes the issue (autofixer) for developers. We're working on open source it to help developers write better code easily. CC Tim We also found a couple other linter related open source tools use LibCST. 3. It's based on parso (which based on pgen2) and currently supports Python 3.5 to 3.8. Tim is working on adding the support back to 3.0 now and potentially 2.7 later. 4. It provides various patterns for traversing and modifying CST easily, including the AST visitor/transformer pattern, matchers pattern, various helpers for find/replace nodes in a tree and high level transform helpers (e.g. added needed import, remove unused import). 5. It also provides metadata for tree node from static analysis, e.g. line/column position, qualified name, scope analysis, inferred type annotation (through Pyre). Those are useful information for building advanced linter or refactoring tool. There are more features available in LibCST. We continue to develop it to make automated refactoring even easier. We welcome your feedback and PRs!

History
Date	User	Action	Args
2022-04-11 14:58:59	admin	set	github: 77518
2021-08-10 08:31:09	lukasz.langa	set	messages: - msg374253
2020-09-07 19:20:04	BTaskaya	set	nosy: + BTaskaya
2020-08-28 19:20:35	lys.nikolaou	set	nosy: + lys.nikolaou
2020-07-25 12:20:33	edreamleo	set	nosy: + edreamleo messages: + msg374253
2020-03-13 13:58:38	jimmylai	set	nosy: + jimmylai messages: + msg364096
2020-03-13 01:05:33	gvanrossum	set	messages: + msg364062
2019-01-13 07:59:02	kernc	set	nosy: + kernc
2018-12-03 10:17:20	n_rosenstein	set	nosy: + n_rosenstein messages: + msg330924
2018-05-03 09:51:44	levkivskyi	set	nosy: + levkivskyi
2018-04-27 16:50:02	jwilk	set	nosy: + jwilk
2018-04-25 21:37:44	gvanrossum	set	messages: + msg315760
2018-04-25 21:15:09	lukasz.langa	set	messages: + msg315759
2018-04-25 11:17:07	ethan smith	set	nosy: + ethan smith
2018-04-25 02:39:54	zsol	set	nosy: + zsol
2018-04-24 15:09:08	gvanrossum	set	messages: + msg315704
2018-04-24 04:12:44	benjamin.peterson	set	messages: + msg315686 stage: patch review ->
2018-04-24 03:54:06	lukasz.langa	set	keywords: + patch stage: patch review pull_requests: + pull_request6285
2018-04-24 00:44:57	njs	set	nosy: + njs messages: + msg315683
2018-04-23 23:28:30	lukasz.langa	set	messages: + msg315682
2018-04-23 21:32:16	gvanrossum	set	messages: + msg315681
2018-04-23 21:02:27	lukasz.langa	set	messages: + msg315680
2018-04-23 20:51:29	gregory.p.smith	set	messages: + msg315678
2018-04-23 08:01:21	lukasz.langa	set	messages: + msg315649
2018-04-23 07:36:12	lukasz.langa	set	messages: + msg315648
2018-04-23 07:01:28	serhiy.storchaka	set	messages: + msg315647
2018-04-23 04:48:05	gvanrossum	set	messages: + msg315646
2018-04-23 01:29:14	lukasz.langa	set	keywords: - patch
2018-04-23 01:28:54	lukasz.langa	set	pull_requests: - pull_request6270
2018-04-23 01:28:02	lukasz.langa	set	messages: + msg315642 stage: patch review -> (no value)
2018-04-23 01:09:05	lukasz.langa	set	keywords: + patch stage: patch review pull_requests: + pull_request6270
2018-04-23 01:04:07	lukasz.langa	create