Issue 46572: Unicode identifiers not necessarily unique

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90730

classification

Title:	Unicode identifiers not necessarily unique
Type:	behavior	Stage:	resolved
Components:	Parser, Unicode	Versions:	Python 3.9, Python 3.8, Python 3.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	da, eryksun, ezio.melotti, lys.nikolaou, pablogsal, vstinner
Priority:	normal	Keywords:

Created on 2022-01-29 17:06 by da, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (4)
msg412084 - (view)	Author: Diego Argueta (da) *	Date: 2022-01-29 17:06
The way Python 3 handles identifiers containing mathematical characters appears to be broken. I didn't test the entire range of U+1D400 through U+1D59F but I spot-checked them and the bug manifests itself there: Python 3.9.7 (default, Sep 10 2021, 14:59:43) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> foo = 1234567890 >>> bar = 1234567890 >>> foo is bar False >>> 𝖇𝖆𝖗 = 1234567890 >>> foo is 𝖇𝖆𝖗 False >>> bar is 𝖇𝖆𝖗 True >>> 𝖇𝖆𝖗 = 0 >>> bar 0 This differs from the behavior with other non-ASCII characters. For example, ASCII 'a' and Cyrillic 'a' are properly treated as different identifiers: >>> а = 987654321 # Cyrillic lowercase 'a', U+0430 >>> a = 123456789 # ASCII 'a' >>> а # Cyrillic 987654321 >>> a # ASCII 123456789 While a bit of a pathological case, it is a nasty surprise. It's possible this is a symptom of a larger bug in the way identifiers are resolved. This is similar but not identical to https://bugs.python.org/issue46555 Note: I did not find this myself; I give credit to Cooper Stimson (https://github.com/6C1) for finding this bug. I merely reported it.
msg412086 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2022-01-29 17:37
This seems coherent with https://www.python.org/dev/peps/pep-3131/ to me. The parser ensures all identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.
msg412096 - (view)	Author: Eryk Sun (eryksun) *	Date: 2022-01-29 19:24
Please read "Identifiers and keywords" [1] in the documentation. For example: >>> import unicodedata as ud >>> ud.normalize('NFKC', '𝖇𝖆𝖗') == 'bar' True >>> c = '\N{CYRILLIC SMALL LETTER A}' >>> ud.name(ud.normalize('NFKC', c)) 'CYRILLIC SMALL LETTER A' --- [1] https://docs.python.org/3/reference/lexical_analysis.html?highlight=nfkc#identifiers
msg412111 - (view)	Author: Diego Argueta (da) *	Date: 2022-01-30 00:06
I did read PEP-3131 before posting this but I still thought the behavior was counterintuitive.

History
Date	User	Action	Args
2022-04-11 14:59:55	admin	set	github: 90730
2022-01-30 00:06:56	da	set	messages: + msg412111
2022-01-29 19:24:56	eryksun	set	status: open -> closed nosy: + eryksun messages: + msg412096 resolution: not a bug stage: resolved
2022-01-29 17:37:09	pablogsal	set	messages: + msg412086
2022-01-29 17:06:08	da	create