โžœ

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lukasz.langa
Recipients benjamin.peterson, gregory.p.smith, gvanrossum, lukasz.langa, serhiy.storchaka
Date 2018-04-23.07:36:11
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1524468972.68.0.682650639539.issue33337@psf.upfronthosting.co.za>
In-reply-to
Content
> I'm in favor of unifying the tokenizers and of updating and moving pgen2 (though I don't have time to do the work).

I'm willing to do all the work as long as I have somebody to review it. Case in point: BPO-33338.



> Also I think you may have to make a distinction between the parser generator and its data structures, and between the generated parser for Python vs. the parser for other LL(1) grammars one might feed into it.

Technically pgen2 has the ability to parse any LL(1) grammar but so far the plumbing is tightly tied to the tokenizer.  We'd need to enable plugging that in, too.



> And I don't think you're proposing to replace Parser/pgen.c with Lib/pgen/, right?

No, I'm not.



> Nor to replace the CST actually used by CPython's parser with the data structures used by pgen2's driver.

No, I'm not.



> So the relationship between the CST you propose to document and CPython internals wouldn't be quite the same as that between the AST used by CPython and the ast module (since those *do* actually use the same code).

Right.  Once we unify the standard library tokenizers (note: *not* tokenizer.c which will stay), there wouldn't be much extra documentation to write for Lib/tokenize.py.  For Lib/pgen/ itself, we'd need to provide both an API rundown and an intro to the high-level functionality (how to create trees from files, string, etc.; how to visit trees and edit them; and so on).


> I'm not sure if it's technically possible to give tokenize.py the ability to tokenize Python 2.7 and up without some version-selection flag -- have you researched this part yet?

There's two schools. This is going to take a while to explain :)

One school is to force the caller to declare what Python version they want to parse.  This is algorithmically cleaner because we can then literally take Grammar/Grammar from various versions of Python and have the user worry about picking the right one.

The other school is what lib2to3 does currently, which is to try to implement as much of a superset of Python versions as possible.  This is way easier to use because the grammar is very forgiving.  However, this has limitations.  There are three major incompatibilities that we need to deal with, with raising degree of severity:
- async/await;
- print statements;
- exec statements.

Async and await became proper keywords in 3.7 and thus broke usage of those as names.  It's relatively easy to work around this one seamlessly by keeping the grammar trickery we've had in place for 3.5 and 3.6.  This is what lib2to3 does today already๐Ÿ‘๐Ÿป

The print statement is fundamentally incompatible with the print function.  lib2to3 has two grammar variants and most users by default choose the one without the print statement.  Why?  Because it cannot be reliably sniffed anymore.  Python 3-only code will not use the __future__ import.  In fact, 2to3 also doesn't do auto-detection, relies on the user running `2to3 -p` to indicate they mean the grammar with the print function.

The exec statement is even worse because there isn't even a __future__ import.  It's annoying because it creates a third combination. ๐Ÿ‘Ž๐Ÿป

So now the driver has to attempt three grammars (in this order):
- the almost compatible combined Python 2 + Python 3 one (that assumes exec is a function and print is a function);
- the one that assumes exec is a *statement* but print is still a function (because __future__ import);
- the one that exposes the legacy exec and print statements.

This approach has one annoying wart.  Imagine you have a file like this:

  print('msg', file=sys.stderr)
  if

Now the driver will attempt all three grammars and fail, and will report that the parse error is on the print line.  This can be overcome by comparing syntax errors from each grammar and showing the one on the furthest line (which is the most likely to be the real culprit).  But it's still annoying and will sometimes not do what the user wanted.


-- OK, OK. So which to choose?

And now, while this sounds like more work and is harder to get right, I still think the combined grammar with minimal incompatibilities is the better approach.  Why?  Two reasons.

1. Nobody ever knows what Python version *exactly* a given file is.  Most files aren't even considering compatibility that fine-grained.  And having to attempt to parse not three but potentially 8 grammars (3.7 - 3.2, 2.7, 2.6) would be prohibitively slow.

2. My tool maybe wants to actually *modify* the compatibility level by, say, rewriting ''.format() with f-strings or putting trailing commas where old Pythons didn't accept them.  So it would be awkward if the grammar I used to read the file wasn't compatible with my later changes.

Unless I'm swayed otherwise, I'd continue on what lib2to3 did, with the exception that we need to add a grammar variant without the `exec` statement, and the driver needs to attempt parsing with the three grammars on its own, with proper syntax error reporting.
History
Date User Action Args
2018-04-23 07:36:12lukasz.langasetrecipients: + lukasz.langa, gvanrossum, gregory.p.smith, benjamin.peterson, serhiy.storchaka
2018-04-23 07:36:12lukasz.langasetmessageid: <1524468972.68.0.682650639539.issue33337@psf.upfronthosting.co.za>
2018-04-23 07:36:12lukasz.langalinkissue33337 messages
2018-04-23 07:36:11lukasz.langacreate