Issue 40678: Full list of Python lexical rules

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84855

classification

Title:	Full list of Python lexical rules
Type:	enhancement	Stage:	resolved
Components:	Documentation	Versions:	Python 3.10

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	cool-RR, docs@python, georg.brandl, gvanrossum, terry.reedy
Priority:	normal	Keywords:

Created on 2020-05-19 06:05 by cool-RR, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5)
msg369320 - (view)	Author: Ram Rachum (cool-RR) *	Date: 2020-05-19 06:05
I'm a noob on parsing, learning about it, so it's possible I've made a mistake somewhere. I know there's this page: https://docs.python.org/3/reference/grammar.html Which is a full listing of Python's grammar. However, looking at this page: https://docs.python.org/3/reference/lexical_analysis.html I see rules that aren't written there, like longstringitem. I'm guessing that's because these are lexing rules, while the former was a list of parsing rules? If that's the case, shouldn't there also be a full, authoritative list of Python's lexical rules? Possibly alongside the parsing rules?
msg369696 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2020-05-23 07:19
First note that 3.8.3 grammar.html is stated to be the actual grammar used by the old parser, and is a bit different from the more human readable grammar given in the reference manual. It is a bit different in 3.9 and I expect will be much more different in 3.10 with the new PEG parser. In the grammar, the CAPITALIZED_NAMES are token names returned by the tokenizer/lexer. This is a standard convention. I am pretty sure that the human readable lexing rules in lexical_analysis are not what the lexer uses. I presume the latter uses barely readable RE expressions, as does the tokenize module. Compare the float grammar in https://docs.python.org/3/reference/lexical_analysis.html#floating-point-literals to the float REs in tokenize.py. def group(choices): return '(' + '\|'.join(choices) + ')' def maybe(choices): return group(choices) + '?' # The above are reused for multiple REs. Exponent = r'[eE][-+]?[0-9](?:_?[0-9])' Pointfloat = group(r'[0-9](?:_?[0-9])\.(?:[0-9](?:_?[0-9]))?', r'\.[0-9](?:_?[0-9])') + maybe(Exponent) Expfloat = r'[0-9](?:_?[0-9])' + Exponent Floatnumber = group(Pointfloat, Expfloat) Note that this is (python) code, not a text specification. You or someone else can look at what the C lexer does. But I think that the proposal should be rejected.
msg369705 - (view)	Author: Ram Rachum (cool-RR) *	Date: 2020-05-23 08:47
Hmm, I feel this isn't right, because I still feel like there should be one place where one can see the full Python syntax specification, lexing and parsing and all. But I'm underqualified to argue because I don't understand the details. Is someone more knowledgeable interested in arguing this point?
msg369716 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2020-05-23 14:05
What you literally seem to ask for does not exist. If you want to pursue this, I suggest posting to python-ideas and you might get support for an acceptable alternative.
msg369723 - (view)	Author: Ram Rachum (cool-RR) *	Date: 2020-05-23 14:25
I understand, thank you.

History
Date	User	Action	Args
2022-04-11 14:59:31	admin	set	github: 84855
2020-05-23 14:25:19	cool-RR	set	status: open -> closed messages: + msg369723 stage: resolved
2020-05-23 14:05:17	terry.reedy	set	messages: + msg369716
2020-05-23 08:47:30	cool-RR	set	messages: + msg369705
2020-05-23 07:27:48	terry.reedy	set	versions: + Python 3.10, - Python 3.6, Python 3.7, Python 3.8, Python 3.9
2020-05-23 07:19:59	terry.reedy	set	nosy: + terry.reedy messages: + msg369696
2020-05-19 06:05:54	cool-RR	create