Message 412045 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	BTaskaya, JelleZijlstra, Kodiologist, benjamin.peterson, pablogsal, sobolevn, terry.reedy
Date	2022-01-29.00:44:17
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1643417058.52.0.798071834626.issue46520@roundup.psfhosted.org>
In-reply-to

Content
'Reserved words' include all double underscore words, like __reserved__. Using such is allowed, but we reserve the right to break such code by adding a use for the word. 'def' is a keyword. Using identifier normalization to smuggle keywords into compiled code is a clever hack. But I am not sure that there is an actionable bug anywhere. The Unicode normalization rules are not defined by us. Changing how we use them or creating a custom normalization form is not to be done lightly. Should ast.parse raise? The effect is the same as "globals()['𝕕𝕖𝕗']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> 𝕕𝕖𝕗", which returns 1. Should such identifiers be outlawed? https://docs.python.org/3/reference/lexical_analysis.html#identifiers says "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC." This does not say when an identifier is compared to the keyword set, before or after normalization. Currently is it before. Changing this to after could be considered a backwards-incompatible feature change that would require a deprecation period with syntax warnings. (Do other implementations also compare before normalization?) Batuhan already quoted https://docs.python.org/3/library/ast.html#ast.unparse and I mostly agree with his comments. The "would produce" part is contingent upon the result having no syntax errors, and that cannot be guaranteed. What could be done is to check every identifier against keywords and change the first character to a chosen NFKD equivalent. Although 'fixing' the ast this way would make unparse seem to work better succeed in this case, there are other fixes that might also be suggested for the same reason. Until this is done in CPython, anyone who cares could write an AST visitor to make the same change before calling unparse. Example code could be attached to this issue.

'Reserved words' include all double underscore words, like __reserved__. Using such is allowed, but we reserve the right to break such code by adding a use for the word. 'def' is a keyword. Using identifier normalization to smuggle keywords into compiled code is a clever hack. But I am not sure that there is an actionable bug anywhere.

The Unicode normalization rules are not defined by us. Changing how we use them or creating a custom normalization form is not to be done lightly.

Should ast.parse raise? The effect is the same as "globals()['𝕕𝕖𝕗']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> 𝕕𝕖𝕗", which returns 1. Should such identifiers be outlawed?

https://docs.python.org/3/reference/lexical_analysis.html#identifiers says "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC." This does not say when an identifier is compared to the keyword set, before or after normalization. Currently is it before. Changing this to after could be considered a backwards-incompatible feature change that would require a deprecation period with syntax warnings. (Do other implementations also compare before normalization?)

Batuhan already quoted https://docs.python.org/3/library/ast.html#ast.unparse and I mostly agree with his comments. The "would produce" part is contingent upon the result having no syntax errors, and that cannot be guaranteed. What could be done is to check every identifier against keywords and change the first character to a chosen NFKD equivalent. Although 'fixing' the ast this way would make unparse seem to work better succeed in this case, there are other fixes that might also be suggested for the same reason.

Until this is done in CPython, anyone who cares could write an AST visitor to make the same change before calling unparse. Example code could be attached to this issue.

History
Date	User	Action	Args
2022-01-29 00:44:18	terry.reedy	set	recipients: + terry.reedy, benjamin.peterson, JelleZijlstra, Kodiologist, pablogsal, BTaskaya, sobolevn
2022-01-29 00:44:18	terry.reedy	set	messageid: <1643417058.52.0.798071834626.issue46520@roundup.psfhosted.org>
2022-01-29 00:44:18	terry.reedy	link	issue46520 messages
2022-01-29 00:44:17	terry.reedy	create