classification
Title: Provide a supported Concrete Syntax Tree implementation in the standard library
Type: Stage:
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ethan Smith, benjamin.peterson, gregory.p.smith, gvanrossum, jwilk, kernc, levkivskyi, lukasz.langa, n_rosenstein, njs, serhiy.storchaka, zsol
Priority: normal Keywords: patch

Created on 2018-04-23 01:04 by lukasz.langa, last changed 2019-01-13 07:59 by kernc.

Pull Requests
URL Status Linked Edit
PR 6572 open lukasz.langa, 2018-04-24 03:54
Messages (16)
msg315638 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 01:04
Python includes a set of batteries that enable parsing of Python code.  This
includes its own AST (provided in the standard library under the `ast` module),
as well as a pure Python tokenizer (provided in the standard library under
`tokenize` and `token`).  It also provides an undocumented CST under lib2to3,
which contains its own outdated and patched copies of `tokenize` and `token`.

This situation causes the following issues for users of Python:
- the built-in AST does not preserve comments or whitespace;
- the built-in AST increasingly modifies the tree before presenting it to user
  code (constant folding moved to the AST in Python 3.7);
- the built-in tokenize.py can only be used to parse Python 3.7+ code;
- the version in lib2to3 is partially customized and partially outdated,
  leaving bits of new grammar not supported; new bits of grammar very often get
  overlooked in lib2to3.
- lib2to3 is not documented.

So if users want to write tools that manipulate Python code, the standard
library doesn't provide them with great options.

I suggest the following plan:

1. Bring Lib/lib2to3/pgen2/tokenize.py to the same state as Lib/tokenize.py
   (leaving the bits that allow for parsing of Python 3.6 and older files).

2. Merge the two tokenizers in Python 3.8 so that Lib/tokenize.py now
   officially supports tokenizing Python 2.7 - 3.7 code.

3. Update Lib/lib2to3/pgen2 and move it under Lib/pgen.  Document it as the
   built-in CST provided by Python for use in applications which require code
   modification.  Make it still officially support parsing of Python 2.7 - 3.7
   code.

All three changes are made in a backwards-compatible fashion, existing code
should NOT break.  That being said, the parser under Lib/pgen might grow some
new behavior compared to the compatibility mode for lib2to3, I specifically
seek to improve handling of comments and error recovery.
msg315642 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 01:28
See BPO-33338 for an implementation of Step 1.
msg315646 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2018-04-23 04:48
I'm glad you've rediscovered pgen2!

I'm in favor of unifying the tokenizers and of updating and moving pgen2 (though I don't have time to do the work).

I'm not sure if it's technically possible to give tokenize.py the ability to tokenize Python 2.7 and up without some version-selection flag -- have you researched this part yet?

Also I think you may have to make a distinction between the parser generator and its data structures, and between the generated parser for Python vs. the parser for other LL(1) grammars one might feed into it.

And I don't think you're proposing to replace Parser/pgen.c with Lib/pgen/, right? Nor to replace the CST actually used by CPython's parser with the data structures used by pgen2's driver. So the relationship between the CST you propose to document and CPython internals wouldn't be quite the same as that between the AST used by CPython and the ast module (since those *do* actually use the same code).
msg315647 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-04-23 07:01
> - the built-in AST increasingly modifies the tree before presenting it to user
>   code (constant folding moved to the AST in Python 3.7);

These modification are applied only before bytecodecode generation. The AST presented to user is not modified.

> - the built-in tokenize.py can only be used to parse Python 3.7+ code;

Is this a problem? 2.7 is a dead Lib/lib2to3/pgen2/tokenize.pyend, its support will be ended in less than 2 years. Even 3.6 will be moved to a security only fixes stage short time after releasing 3.8.

I'm in favor of updating Lib/lib2to3/pgen2/tokenize.py, but I don't understand why Lib/tokenize.py should parse 2.7.

I'm in favor of reimplementing pgen in Python if this will simplify the code and the building process. Python code is simpler than C code, this code is not performance critical, and in any case we need an external Python when modify grammar of bytecode.

See also issue30455 where I try to get rid of duplications by generating all tokens-related data and code from a single source (token.py or external text file).

For what purposes the CST is needed besides 2to3? I know only that it could help to determine the correct position in docstrings in doctests and similar tools which need to process docstrings and report errors. This is not possible with AST due to inlined '\n', escaped newlines, and string literals concatenation. Changes in 3.7 made this even worse (see issue32911).
msg315648 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 07:36
> I'm in favor of unifying the tokenizers and of updating and moving pgen2 (though I don't have time to do the work).

I'm willing to do all the work as long as I have somebody to review it. Case in point: BPO-33338.



> Also I think you may have to make a distinction between the parser generator and its data structures, and between the generated parser for Python vs. the parser for other LL(1) grammars one might feed into it.

Technically pgen2 has the ability to parse any LL(1) grammar but so far the plumbing is tightly tied to the tokenizer.  We'd need to enable plugging that in, too.



> And I don't think you're proposing to replace Parser/pgen.c with Lib/pgen/, right?

No, I'm not.



> Nor to replace the CST actually used by CPython's parser with the data structures used by pgen2's driver.

No, I'm not.



> So the relationship between the CST you propose to document and CPython internals wouldn't be quite the same as that between the AST used by CPython and the ast module (since those *do* actually use the same code).

Right.  Once we unify the standard library tokenizers (note: *not* tokenizer.c which will stay), there wouldn't be much extra documentation to write for Lib/tokenize.py.  For Lib/pgen/ itself, we'd need to provide both an API rundown and an intro to the high-level functionality (how to create trees from files, string, etc.; how to visit trees and edit them; and so on).


> I'm not sure if it's technically possible to give tokenize.py the ability to tokenize Python 2.7 and up without some version-selection flag -- have you researched this part yet?

There's two schools. This is going to take a while to explain :)

One school is to force the caller to declare what Python version they want to parse.  This is algorithmically cleaner because we can then literally take Grammar/Grammar from various versions of Python and have the user worry about picking the right one.

The other school is what lib2to3 does currently, which is to try to implement as much of a superset of Python versions as possible.  This is way easier to use because the grammar is very forgiving.  However, this has limitations.  There are three major incompatibilities that we need to deal with, with raising degree of severity:
- async/await;
- print statements;
- exec statements.

Async and await became proper keywords in 3.7 and thus broke usage of those as names.  It's relatively easy to work around this one seamlessly by keeping the grammar trickery we've had in place for 3.5 and 3.6.  This is what lib2to3 does today already👍🏻

The print statement is fundamentally incompatible with the print function.  lib2to3 has two grammar variants and most users by default choose the one without the print statement.  Why?  Because it cannot be reliably sniffed anymore.  Python 3-only code will not use the __future__ import.  In fact, 2to3 also doesn't do auto-detection, relies on the user running `2to3 -p` to indicate they mean the grammar with the print function.

The exec statement is even worse because there isn't even a __future__ import.  It's annoying because it creates a third combination. 👎🏻

So now the driver has to attempt three grammars (in this order):
- the almost compatible combined Python 2 + Python 3 one (that assumes exec is a function and print is a function);
- the one that assumes exec is a *statement* but print is still a function (because __future__ import);
- the one that exposes the legacy exec and print statements.

This approach has one annoying wart.  Imagine you have a file like this:

  print('msg', file=sys.stderr)
  if

Now the driver will attempt all three grammars and fail, and will report that the parse error is on the print line.  This can be overcome by comparing syntax errors from each grammar and showing the one on the furthest line (which is the most likely to be the real culprit).  But it's still annoying and will sometimes not do what the user wanted.


-- OK, OK. So which to choose?

And now, while this sounds like more work and is harder to get right, I still think the combined grammar with minimal incompatibilities is the better approach.  Why?  Two reasons.

1. Nobody ever knows what Python version *exactly* a given file is.  Most files aren't even considering compatibility that fine-grained.  And having to attempt to parse not three but potentially 8 grammars (3.7 - 3.2, 2.7, 2.6) would be prohibitively slow.

2. My tool maybe wants to actually *modify* the compatibility level by, say, rewriting ''.format() with f-strings or putting trailing commas where old Pythons didn't accept them.  So it would be awkward if the grammar I used to read the file wasn't compatible with my later changes.

Unless I'm swayed otherwise, I'd continue on what lib2to3 did, with the exception that we need to add a grammar variant without the `exec` statement, and the driver needs to attempt parsing with the three grammars on its own, with proper syntax error reporting.
msg315649 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 08:01
> These modification are applied only before bytecodecode generation. The AST presented to user is not modified.

This bit me when implementing PEP 563 but I was then on the compile path, right.  Still, the latest docstring folding would qualify as an example here, too, no?


> Is this a problem? 2.7 is a dead end, its support will be ended in less than 2 years. Even 3.6 will be moved to a security only fixes stage short time after releasing 3.8.

Yes, it is a problem.  We will support Python 2 until 2020 but people will be running Python 2 code for a decade *at least*.  We need to provide those people a way to move their code forward.  Static analysis tools like formatters, linters, type checkers, or 2to3-style translators, are all soon going to run on Python 3.  It would be a shame if those programs were barred from helping users that are still struggling on Python 2.

A closer example is async/await.  It would be a shame if running on Python 3.7 meant you can't write a tool that renames (or even just *detects*) invalid uses of async/await.  I firmly believe that the version of the runtime should be indepedent of the version it's able to analyze.


> I'm in favor of updating Lib/lib2to3/pgen2/tokenize.py, but I don't understand why Lib/tokenize.py should parse 2.7.

Hopefully I sufficiently explained that above.


> I'm in favor of reimplementing pgen in Python if this will simplify the code and the building process. Python code is simpler than C code, this code is not performance critical, and in any case we need an external Python when modify grammar of bytecode.

Well, I didn't think about abandoning pgen.  I admit that's mostly because my knee-jerk reaction was that it would be too slow.  But you're right that this is not performance critical because every `pip install` runs `compileall`.

I guess we could parse in "strict" mode for Python itself but allow for multiple grammars for standard library use (as I explained in the reply to Guido).  And this would most likely give us opportunity to iterate on grammar improvements in the future.

And yet, I'm cautious here.  Even ignoring performance, that sounds like a more ambitious task from what I'm attempting.  Unless I find partners in crime for this, I wouldn't attempt that.  And I would need thumbs up from the BDFL and performance-wary contributors.


> For what purposes the CST is needed besides 2to3?

Anywhere where you need the full view of the code which includes non-semantic pieces.  Those include:
- whitespace;
- comments;
- parentheses;
- commas;
- strings prefixes.

The main use case is linters and refactoring tools.  For example mypy is using a modified AST to support type comments.  YAPF and Black are based on lib2to3 because as formatters they can't lose comments, string prefixes, and organizational parentheses either.  JEDI is using Parso, a lib2to3 fork, for similar reasons.
msg315678 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2018-04-23 20:51
+1 in general to this work. Łukasz is effectively preaching to the choir by looping me in here. :)

It is a big challenge to practically support Python in that we have no good ability to parse and understand all language syntax versions via a single API that does not depend on the version of the language your own tools process is running under.

lib2to3.pgen2 is the closest thing we've got and it used by a notable crop of python refactoring tools today because there really wasn't another available choice.  All they know is that they've got a ".py" file, they can't know which specific language versions it may be intended for.  Nor should they ever need to run _on_ that language version.  That situation is a nightmare (ex: pylint uses ast and must run on the version of the language it is to analyze as)

I'd love to see a ponycorn module that everything could use to run on top of Python 3.recent yet be able to meaningfully process 2.7 and 3.4-3.7 code.  This is an area where the language versions we support parsing and analyzing should _not_ be limited to the current CPython org still supported releases.

Does this need to go in the CPython project and integrate with its internals such as pgen.c or pgen2?  I don't know.  From my perspective this could be a PyPI project.  Even if it seems odd that we have stdlib ast and lib2to3.pgen2 modules and pgen internal to CPython; at some point those could be seen as implementation details and made private in favor of tool application code using a canonical ponycorn thing on PyPI.  The important part is that it is maintained and kept up to date with future language grammar changes while maintaining "backwards grammar compatibility".
msg315680 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 21:02
> The important part is that it is maintained and kept up to date with future language grammar changes while maintaining "backwards grammar compatibility".

Yes, which is why I have trouble believing this can be effectively outsourced.  Existing third-party libraries always stalled at some point in this regard.
msg315681 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2018-04-23 21:32
But lib2to3 is proof that the stdlib is just as much subject to stalling.
Maybe lib2to3 and pgen2 would have a livelier future if they weren't
limited to updates in sync with Python releases.
msg315682 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 23:28
> But lib2to3 is proof that the stdlib is just as much subject to stalling.

The issue here is internal visibility. "lib2to3" is a library that supports "2to3" which is rather neglected internally since we started promoting `six` as a better migration strategy to Python 3.

Most core devs don't even *know* new syntax is supposed to be added to lib2to3.  Case in point: somehow Lib/tokenize.py was updated just in time for f-strings to be released but not Lib/lib2to3/pgen2/tokenize.py.

By unifying the tokenizers and moving the CST out of lib2to3's guts (and documenting it as a supported feature!), I'm pretty sure we can eliminate the danger of forgetting to update it in the future.
msg315683 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2018-04-24 00:44
It does seem like it'd be unfortunate to end up in a situation like "sorry, there's a bug in handling this python 2 code, so black won't be able to reformat it until the next major python release". And I assume this issue is motivated by running into limitations of the current version; waiting for 3.8 before you can fix those seems unfortunate too?

Another option to think about: make the library something that's maintained by python-dev, but released separately on PyPI.
msg315686 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-04-24 04:12
The stdlib is a bad place for anything that needs to evolve at a non-glacial place. For example, even when 2to3 had not yet fallen out of favor, there were effectively 3 versions of it: one 2.7 and two in maintained 3.x branches. That was a large pain. 2to3 also could only be updated as quickly as Python is released.
msg315704 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2018-04-24 15:09
Lukasz, pleased consider seriously to move to a 3rd party package. Even
pgen2.

On Mon, Apr 23, 2018, 21:12 Benjamin Peterson <report@bugs.python.org>
wrote:

>
> Benjamin Peterson <benjamin@python.org> added the comment:
>
> The stdlib is a bad place for anything that needs to evolve at a
> non-glacial place. For example, even when 2to3 had not yet fallen out of
> favor, there were effectively 3 versions of it: one 2.7 and two in
> maintained 3.x branches. That was a large pain. 2to3 also could only be
> updated as quickly as Python is released.
>
> ----------
> stage: patch review ->
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue33337>
> _______________________________________
>
msg315759 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-25 21:15
[njs]
> "there's a bug in handling this python 2 code, so black won't be able to reformat it until the next major python release"

Nah, we're still allowed to fix bugs in micro releases.  We should have more of those instead of sitting on fixed bugs for months.  That's a discussion for a different venue though.


[gutworth]
> The stdlib is a bad place for anything that needs to evolve at a non-glacial place.

The syntax tree only needs to evolve to keep up with current Python development.  That's why I think it makes sense to tie the two.


[gvr]
> please consider seriously to move to a 3rd party package

Does that also invalidate the idea to merge the tokenizers?

And if so, does that also invalidate the idea to update lib2to3's tokenizer (BPO-33338)?
msg315760 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2018-04-25 21:37
I think merging the tokenizers still makes sense. We can then document
top-level tokenize.py (in 3.8 and later) as guaranteed to be able to
tokenize anything going back to Python 2.7. And since lib2to3/pgen2 it is
undocumented I presume removing lib2to3/pgen2/tokenize.py isn't going to
break anything -- but if we worry about that it could be made into a
trivial wrapper for top-level tokenize.py.

Still, the improvements you're planning to lib2to3 (no matter how
compatible) will benefit more people sooner if you extract it into its own
PyPI package. Not everybody can upgrade to 3.7 as soon as Instagram. :-)
msg330924 - (view) Author: Niklas Rosenstein (n_rosenstein) Date: 2018-12-03 10:17
Lukasz, have you created a 3rd party package branching off lib2to3? I'm working on a project that is based on it (in a similar usecase as YAPF and Black) and was hoping that there may be some version maintained distinctly from the Python release schedule.
History
Date User Action Args
2019-01-13 07:59:02kerncsetnosy: + kernc
2018-12-03 10:17:20n_rosensteinsetnosy: + n_rosenstein
messages: + msg330924
2018-05-03 09:51:44levkivskyisetnosy: + levkivskyi
2018-04-27 16:50:02jwilksetnosy: + jwilk
2018-04-25 21:37:44gvanrossumsetmessages: + msg315760
2018-04-25 21:15:09lukasz.langasetmessages: + msg315759
2018-04-25 11:17:07Ethan Smithsetnosy: + Ethan Smith
2018-04-25 02:39:54zsolsetnosy: + zsol
2018-04-24 15:09:08gvanrossumsetmessages: + msg315704
2018-04-24 04:12:44benjamin.petersonsetmessages: + msg315686
stage: patch review ->
2018-04-24 03:54:06lukasz.langasetkeywords: + patch
stage: patch review
pull_requests: + pull_request6285
2018-04-24 00:44:57njssetnosy: + njs
messages: + msg315683
2018-04-23 23:28:30lukasz.langasetmessages: + msg315682
2018-04-23 21:32:16gvanrossumsetmessages: + msg315681
2018-04-23 21:02:27lukasz.langasetmessages: + msg315680
2018-04-23 20:51:29gregory.p.smithsetmessages: + msg315678
2018-04-23 08:01:21lukasz.langasetmessages: + msg315649
2018-04-23 07:36:12lukasz.langasetmessages: + msg315648
2018-04-23 07:01:28serhiy.storchakasetmessages: + msg315647
2018-04-23 04:48:05gvanrossumsetmessages: + msg315646
2018-04-23 01:29:14lukasz.langasetkeywords: - patch
2018-04-23 01:28:54lukasz.langasetpull_requests: - pull_request6270
2018-04-23 01:28:02lukasz.langasetmessages: + msg315642
stage: patch review -> (no value)
2018-04-23 01:09:05lukasz.langasetkeywords: + patch
stage: patch review
pull_requests: + pull_request6270
2018-04-23 01:04:07lukasz.langacreate