This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: ast.unparse produces bad code for identifiers that become keywords
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BTaskaya, JelleZijlstra, Kodiologist, benjamin.peterson, pablogsal, sobolevn, terry.reedy
Priority: normal Keywords: patch

Created on 2022-01-25 15:06 by Kodiologist, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 31012 open Kodiologist, 2022-01-29 18:06
Messages (7)
msg411606 - (view) Author: (Kodiologist) * Date: 2022-01-25 15:06
This works:

    𝕕𝕖𝕗 = 1

This raises SyntaxError:

    import ast
    exec(ast.unparse(ast.parse("𝕕𝕖𝕗 = 1")))

It looks like `ast.parse` creates a `Name` node with `id='def'`, which is correct per PEP 3131, but `ast.unparse` doesn't know it needs to mangle the output somehow, as "𝕕𝕖𝕗" or a similar Unicode replacement.
msg411669 - (view) Author: Nikita Sobolev (sobolevn) * (Python triager) Date: 2022-01-25 19:53
I can confirm that it happens on all versions from 3.9 to 3.11 (main).

```
Python 3.9.9 (main, Dec 21 2021, 11:35:28) 
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ast
>>> ast.unparse(ast.parse("𝕕𝕖𝕗 = 1"))
'def = 1'
>>> exec(ast.unparse(ast.parse("𝕕𝕖𝕗 = 1")))  # SyntaxError
```

```
Python 3.11.0a4+ (heads/main-dirty:ef3ef6fa43, Jan 20 2022, 20:48:25) [Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ast
>>> ast.unparse(ast.parse("𝕕𝕖𝕗 = 1"))
'def = 1'
>>> exec(ast.unparse(ast.parse("𝕕𝕖𝕗 = 1")))  # SyntaxError
```
msg411742 - (view) Author: Batuhan Taskaya (BTaskaya) * (Python committer) Date: 2022-01-26 11:56
Technically, this is a bug on the fact that it breaks the only guarantee of ast.unparse:

> Unparse an ast.AST object and generate a string with code that would produce an equivalent ast.AST object if parsed back with ast.parse().

But I am not really sure if it should be handled at all, since we don't have access to the original form of the identifier in the AST due to the parser's normalization behavior.

If we want to only create a source that would give the same AST, abusing the fact that original keywords are always basic ASCII we could embed a map of characters that convert ASCII 'a', 'b', 'c', ... to their most similar unicode versions (https://util.unicode.org/UnicodeJsps/confusables.jsp). But I feel like this is a terrible idea, with no possible gain (very limited use case) and very prone to a lot of confusions. 

I think just adding a warning to the documentation regarding this should be the definite resolution, unless @pablogsal has any other idea.
msg411780 - (view) Author: (Kodiologist) * Date: 2022-01-26 17:06
I've done very little work on CPython, but I do a lot of AST construction and call `ast.unparse` a lot in my work on Hylang, and I think this is a wart worth fixing. The real mistake was letting the user say `𝕕𝕖𝕗 = 1`, but that's been legal Python syntax for a long time, so I doubt a change to that would be welcome, especially one affecting old stable versions of Python like 3.9. Python has made its bed and now it must lie in it.

I think that with a pretty small amount of code (using code-point arithmetic instead of a dictionary with every ASCII letter), I can add Unicode "escaping" of reserved words to the part of `ast.unparse` that renders variable names. Would a patch of this kind be welcome?
msg411781 - (view) Author: (Kodiologist) * Date: 2022-01-26 17:09
And yes, while this behavior will look strange, the only code that will parse to AST nodes that require it will be code that uses exactly the same trick.
msg412045 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2022-01-29 00:44
'Reserved words' include all double underscore words, like __reserved__.  Using such is allowed, but we reserve the right to break such code by adding a use for the word.  'def' is a keyword.  Using identifier normalization to smuggle keywords into compiled code is a clever hack.  But I am not sure that there is an actionable bug anywhere.  

The Unicode normalization rules are not defined by us.  Changing how we use them or creating a custom normalization form is not to be done lightly.

Should ast.parse raise?  The effect is the same as "globals()['𝕕𝕖𝕗']=1" (which is the same as passing 'def' or anything else that normalizes to it) and that in turn allows ">>> 𝕕𝕖𝕗", which returns 1.  Should such identifiers be outlawed?

https://docs.python.org/3/reference/lexical_analysis.html#identifiers says "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."  This does not say when an identifier is compared to the keyword set, before or after normalization.  Currently is it before.  Changing this to after could be considered a backwards-incompatible feature change that would require a deprecation period with syntax warnings.  (Do other implementations also compare before normalization?)

Batuhan already quoted https://docs.python.org/3/library/ast.html#ast.unparse and I mostly agree with his comments.  The "would produce" part is contingent upon the result having no syntax errors, and that cannot be guaranteed.  What could be done is to check every identifier against keywords and change the first character to a chosen NFKD equivalent.  Although 'fixing' the ast this way would make unparse seem to work better succeed in this case, there are other fixes that might also be suggested for the same reason. 

Until this is done in CPython, anyone who cares could write an AST visitor to make the same change before calling unparse.  Example code could be attached to this issue.
msg412078 - (view) Author: (Kodiologist) * Date: 2022-01-29 15:26
(Hilariously, I couldn't post this comment on bugs.python.org due to some kind of Unicode bug ("Edit Error: 'utf8' codec can't decode bytes in position 208-210: invalid continuation byte"), so I've rendered "\U0001D555\U0001D556\U0001D557" as "DEF" in the below.)

Thanks for clarifying the terminology re: reserved words vs. keywords.

> The effect is the same as "globals()['DEF']=1" (which is the same as
> passing 'def' or anything else that normalizes to it) and that in
> turn allows ">>> DEF", which returns 1.

This doesn't quite seem to be the case, at least on Pythons 3.9 and 3.10:

    >>> globals()['DEF']=1
    >>> DEF
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NameError: name 'def' is not defined
    >>> globals()['def']=1
    >>> DEF
    1

It looks the dictionary interface to `globals` doesn't normalize like the parser does.
History
Date User Action Args
2022-04-11 14:59:55adminsetgithub: 90678
2022-01-29 18:06:22Kodiologistsetkeywords: + patch
stage: patch review
pull_requests: + pull_request29191
2022-01-29 15:26:02Kodiologistsetmessages: + msg412078
2022-01-29 00:44:18terry.reedysetnosy: + terry.reedy

messages: + msg412045
title: `ast.unparse` produces syntactically illegal code for identifiers that look like reserved words -> ast.unparse produces bad code for identifiers that become keywords
2022-01-26 17:09:05Kodiologistsetmessages: + msg411781
2022-01-26 17:06:51Kodiologistsetmessages: + msg411780
2022-01-26 11:56:35BTaskayasetmessages: + msg411742
2022-01-25 19:53:52sobolevnsetnosy: + sobolevn

messages: + msg411669
versions: + Python 3.9, Python 3.11
2022-01-25 15:58:41JelleZijlstrasetnosy: + benjamin.peterson, JelleZijlstra, pablogsal, BTaskaya
2022-01-25 15:06:17Kodiologistcreate