Issue 10952: Don't normalize module names to NFKC?

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/55161

classification

Title:	Don't normalize module names to NFKC?
Type:		Stage:
Components:	Interpreter Core, Unicode	Versions:	Python 3.2, Python 3.3

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	belopolsky, ezio.melotti, ishimoto, loewis, vstinner
Priority:	normal	Keywords:

Created on 2011-01-20 01:54 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
module_name.py	vstinner, 2011-01-20 01:54

Messages (18)
msg126577 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-01-20 01:54
The Python 3 parser normalizes all identifiers using NFKC (as described in the PEP 3131). Examples: - U+00B5 (µ: Micro sign) is normalized to U+03BC (μ: Greek small letter mu) - U+FB03 (ﬃ: Latin small ligature ffi) is normalized to 'ffi' The problem is that it does also normalize module names, but not the filename. The module name in the Python source code is written with the keyboard (eg. U+00B5 in my case) and then normalized to NFKC (=> U+03BC). The filename is also written using the keyboard (U+00B5), but it is never normalized. Attached script tests the current behaviour using "µTorrent" name with U+00B5 and U+03BC: import with U+00B5 or U+03BC use the filename with U+03BC. The problem is that I'm able to write 'µ' (U+00B5) with my keyboard, but not U+03BC (μ).
msg126579 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-01-20 02:00
"µTorrent.py" filename example comes from #10754. This issue is unrelated to the Python parser or the import machinery: it is a surprising behaviour of the MBCS codec which replaces unencodable characters to a similar glyph. I changed the MBCS in Python 3.2 to be strict (it now raises an error on unencodable character).
msg126580 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-01-20 02:18
This proposal makes sense because it would make import µTorrent behave the same as µTorrent = __import__('µTorrent') However, I think this is a feature request and a language change because the current grammar is import_stmt ::= "import" module .. module ::= (identifier ".")* identifier and in order to implement the proposed feature, "module" will have to become a separate AST node that won't be treated as identifier.
msg126581 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-01-20 02:21
New problem: if the parser doesn't normalize module names on import, it does still normalize module names on other instructions. Example: "import \xB5Torrent; del \xB5Torrent" raises an error on del because the parser normalized del identifier (the second module name) => "import \xB5Torrent; del \u03BCTorrent".
msg126582 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-01-20 02:22
See also #3080 (which is not directly related).
msg126583 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-01-20 02:31
On Wed, Jan 19, 2011 at 9:21 PM, STINNER Victor <report@bugs.python.org> wrote: .. > New problem: if the parser doesn't normalize module names on import, it does still > normalize module names on other instructions. > > Example: "import \xB5Torrent; del \xB5Torrent" raises an error on del because the parser > normalized del identifier (the second module name) => "import \xB5Torrent; del \u03BCTorrent". > This won't be a problem if you make "import \xB5Torrent" behave as "\xB5Torrent = __import__('\xB5Torrent')". The latter is equivalent to "\u03BCTorrent = __import__('\xB5Torrent')".
msg126584 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-01-20 03:02
> This won't be a problem if you make > "import \xB5Torrent" > behave as (...) > "\u03BCTorrent = __import__('\xB5Torrent')" "import name" is compiled to "IMPORT_NAME(name); STORE_NAME(name)" bytecode instructions. So you proposed to compile it to "IMPORT_NAME(name); STORE_NAME(normalized_name)" if name is different than the normalized name. Ok, I think that it is possible.
msg126587 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-01-20 04:11
Victor> Ok, I think that it is possible. While it is possible, I am not sure it is a good idea. For example, if a filesystem uses encoding that is capable of distinguishing between "\xB5Torrent.py" and "\u03BCTorrent.py", should "import \xB5Torrent" and "import \u03BCTorrent" import different modules?
msg126590 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-01-20 06:19
I think this issue falls into a similar category as support for case-insensitive but case-preserving file systems. Python uses regular file system lookups, but then may need to verify whether it got the right one. I'd like to request that PEP 3131 is followed as it stands: identifier lookup uses NFKC, period. This gives two issues: a) how can users make sure that they name the files correctly? and b) what if the file system implementation mangles file names. For b), I'd use the same approach as with case-insensitive lookups: verify that the file we read is really the one we want. For a), wrt. "I'm not able to write U+03BC with my keyboard", I say "tough luck - don't use that character in a module name, then". Somebody with a Greek keyboard will have no problems doing that. This is really the same as any other non-ASCII character which you are unable to type: it just means that you can't conveniently enter the respective Python identifier. Just try importing "саша", for example. Get a different keyboard.
msg126592 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-01-20 06:40
On Thu, Jan 20, 2011 at 1:19 AM, Martin v. Löwis <report@bugs.python.org> wrote: .. > I'd like to request that PEP 3131 is followed as it stands: identifier lookup uses NFKC, > period. This gives two issues: a) how can users make sure that they name the files > correctly? and b) what if the file system implementation mangles file names. > There is also issue c) what if the filesystem encoding can only represent a compatibility character, say U+00B5, but not its NFKC equivalent, U+03BC? Suppose you have a system with both locale and FS encodings being Latin-1. You can write Python code using Latin-1 and the following is valid bytestream: b'# encoding: latin-1\nimport \xB5Torrent\n" However, this code will always fail because '\xB5Torrent' will be normalized into '\u03BCTorrent' and a file named '\u03BCTorrent.py' cannot be created on a filesystem with Latin-1 encoding.
msg126597 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-01-20 11:07
> b) what if the file system implementation mangles file names. > > I'd use the same approach as with case-insensitive lookups: verify > that the file we read is really the one we want. Only Mac OS X and the HFS+ filesystem normalize filenames (to a variant of NFD). But such normalization is a good thing! I mean that I don't think that we have anything to do for that. --- The user creates café.py file, name written with the keyboard in NFD: cafe\u0301 (this is very unlikely, all operating systems prefer NFC for the keyboard, but it's just to give an example). Mac OS X normalizes the filename to NFD: cafe\u0301.py is created in the filesystem. Then (s)he tries to import the café module: write "import café" with his/her NFD keyboard. Python normalizes café to NFKC (caf\xe9) and then tries to read caf\xe9.py. Mac OS X normalizes the filename to NFD: cafe \u0301.py, and this file, so it works as expected. --- I suppose that any filesystem normalization is good, because it avoids surprising behaviours (eg. having two files cafe\u0301 and caf\xe9 with names rendered exactly the same on screen). We should maybe patch Windows, Mac OS, Linux & co to normalize to NFKC :-) > a) how can users make sure that they name the files correctly? > > For a), wrt. "I'm not able to write U+03BC with my keyboard", I say > "tough luck - don't use that character in a module name, then". > Somebody with a Greek keyboard will have no problems doing that. Even if I try to agree with "don't use that character in a module name": it can be surprising for an English who would like to use µTorrent (U +00B5) module name in his/her project. She/He can creates µTorrent.py with his non-Greek keyboard (\xb5Torrent.py), but than import µTorrent (import \xb5Torrent) fails: "ImportError: No module named µTorrent". The error message is "ImportError: No module named \u03BCTorrent": the identifier is normalized, but remember that µ (U+00B5) and μ (U+03BC) are rendered exactly the same by most fonts. We should at least document this surprising behaviour in the import documentation. Something like: << WARNING: Non-ASCII characters in module names are normalized to NFKC by the Python parser ([PEP 3131]). For example, import µTorrent (µ: U +00B5) is normalized to import μTorrent (μ: U+03BC): Python will try to open "\u03BCTorrent.py" (or "\u03BCTorrent/__init__.py"), and not "\xB5Torrent.py" (or "\xB5Torrent/__init__.py"). >> > This is really the same as any other non-ASCII character which you are > unable to type: it just means that you can't conveniently enter the > respective Python identifier. Just try importing "саша", for example. > Get a different keyboard. I disagree. For identifiers in the source code, it works (transparently) as expected. A Greek starts a project using µTorrent (\u03BCTorrent) identifier in its source code (a variable name, not a module name). An English writes a patch using µTorrent written with \xB5Torrent: both forms are accepted by Python, and it works. "exec")) it works
msg126602 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-01-20 13:06
> There is also issue c) what if the filesystem encoding can only > represent a compatibility character, say U+00B5, but not its NFKC > equivalent, U+03BC? It is the same problem than not being able to write U+03BC with a keyboard: in this setup, don't use U+00B5 or U+03BC. More generally: don't use non-ASCII characters if your setup is not fully Unicode compliant, or fix your setup :-)
msg126632 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-01-20 17:47
On Thu, Jan 20, 2011 at 8:06 AM, STINNER Victor <report@bugs.python.org> wrote: .. >> There is also issue c) what if the filesystem encoding can only >> represent a compatibility character, say U+00B5, but not its NFKC >> equivalent, U+03BC? > > It is the same problem than not being able to write U+03BC with a keyboard: No. This is a different problem and I agree with Martin that keyboard limitations are not an issue. With proper tools one can create '\u03BCTorrent.py" file even if the keyboard does not have a '\u03BC' key as long as the filesystem is capable of storing such file. Python itself is one such tool: >>> with open('\u03BCTorrent.py'.encode(fsencoding), 'w') as f: ... However, if fsencoding = 'latin-1', the code above will fail. One possible solution to this problem is to define a 'compat' error handler that would detect unencodable strings with encodable compatibility equivalents and produce encoding of an NFKC equivalent string instead of raising an error. ISTM, that in the Latin-1 encoding, there are only five affected characters: ... dec = decomposition(chr(i)) ... if dec and dec.startswith('<compat>'): ... print("U+00%02X '%s' (%s): %s" %(i, chr(i), name(chr(i)), dec)) ... U+00A8 '¨' (DIAERESIS): <compat> 0020 0308 U+00AF '¯' (MACRON): <compat> 0020 0304 U+00B4 '´' (ACUTE ACCENT): <compat> 0020 0301 U+00B5 'µ' (MICRO SIGN): <compat> 03BC U+00B8 '¸' (CEDILLA): <compat> 0020 0327 I suspect that the number of affected characters in the other encodings is similarly small. If we further limit special handling to characters that are valid in identifiers, U+00B5 will end up being the only such character in Latin-1. An import mechanism using encode(fsencoding, 'compat') will, when given either "import \u00B5Torrent" or "import \u03BCTorrent" in source file, open "\u03BCTorrent.py" when fsencoding='utf-8' and "\u00B5Torrent.py" if fsencoding='latin-1'. A packaging mechanism that prepares code developed on a Latin-1 filesystem for distribution, would have to NFKC-normalize filenames before encoding them using UTF-8.
msg126656 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-01-20 22:23
> A packaging mechanism that prepares code developed on a Latin-1 > filesystem for distribution, would have to NFKC-normalize > filenames before encoding them using UTF-8. It causes portability issues: if you copy a non-ASCII module on a new host, the program will work or not depending on the filesystem encoding. Having to transform the filename when you copy a file, just to fix a corner case, is a pain. > One possible solution to this problem is to define a 'compat' error > handler that would detect unencodable strings with encodable > compatibility equivalents and produce encoding of an NFKC equivalent > string instead of raising an error. Only few people use non-ASCII module names and most operating systems are able to store all Unicode characters, so I don't think that we need to support U+00B5 in a module name with Latin1 filesystem at all. If you use an old system using Latin1 filesystem, you have to limit your expectation on Python unicode support :-) os.fsencode() and os.fsdecode() already use a custom error handler: surrogateescape. compat will conflict with surrogateescape. Loading a module concatenates two parts: a path from sys.path (decoded from the filesystem encoding and surrogateescape error handler) and a module name. If custom is used to encode the filename, the module name will be encoded correctly, but not the path.
msg126666 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-01-21 00:02
> There is also issue c) what if the filesystem encoding can only > represent a compatibility character, say U+00B5, but not its NFKC > equivalent, U+03BC? That should be considered as similar to file systems that just cannot represent certain characters at all - e.g. many of the non-ASCII characters, or no upper-case letters. If you have such a file system, you just cannot use these characters in a module name. Rename your modules, then, or put the modules in a zipfile (or use some other import hook). > However, this code will always fail because '\xB5Torrent' will be > normalized into '\u03BCTorrent' and a file named '\u03BCTorrent.py' > cannot be created on a filesystem with Latin-1 encoding. Tough luck. The filesystem just doesn't support GREEK SMALL LETTER MU, just as it doesn't support all the other greek characters. It may be fun coming up with these border cases. But I really don't see a need to support them. If you really need to have that letter in a module name, reformat your disk with a better file system.
msg126669 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-01-21 00:07
> Only Mac OS X and the HFS+ filesystem normalize filenames (to a variant > of NFD). But such normalization is a good thing! I mean that I don't > think that we have anything to do for that. That may well be - I don't have a case where this would cause problems, either. > We should at least document this surprising behaviour in the import > documentation. There are also are better ways to support the user than mere documentation. For example,, the exception message could be more helpful, and IDLE could warn the user when saving the file in the first place. > << WARNING: Non-ASCII characters in module names are normalized to NFKC > by the Python parser ([PEP 3131]). For example, import µTorrent (µ: U > +00B5) is normalized to import μTorrent (μ: U+03BC): Python will try to > open "\u03BCTorrent.py" (or "\u03BCTorrent/__init__.py"), and not > "\xB5Torrent.py" (or "\xB5Torrent/__init__.py"). >> I can't believe this is a real problem. I'd defer warning about made-up problems until real users report them as a real problem. > I disagree. If you disagree strongly, please write a PEP.
msg127183 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-01-27 12:35
It looks like there is nothing interesting to do here, so I close the issue (which is not a bug :-)).
msg182911 - (view)	Author: Atsuo Ishimoto (ishimoto) *	Date: 2013-02-25 02:51
Converting identifiers to NFKC is problematic to work with FULLWIDTH letters such as 'ａ'(FULLWIDTH LATIN SMALL LETTER A). We can create module named 'ａａａ.py', but this module could not be imported on all platforms I know. >>> import ａａａ Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named 'aaa' Talking about Japanese environment, I don't see benefit to normalize variable names. FULLWIDTH/HALFWIDTH compatibility characters are commonly used here, and they are recognized different characters. It would be too late to argue, but converting to normal form NKC instead of NFKC would be better. Python distinguishes small letters and large letters, but doesn't distinguish fullwidth and halfwidth. This is a pretty surprising behavior to me.

History
Date	User	Action	Args
2022-04-11 14:57:11	admin	set	github: 55161
2013-02-25 03:03:01	ezio.melotti	set	nosy: + ezio.melotti versions: - Python 3.1
2013-02-25 02:51:34	ishimoto	set	messages: + msg182911
2013-02-25 00:26:54	ishimoto	set	nosy: + ishimoto
2011-01-27 12:35:25	vstinner	set	status: open -> closed messages: + msg127183 resolution: not a bug nosy: loewis, belopolsky, vstinner
2011-01-21 00:07:58	loewis	set	nosy: loewis, belopolsky, vstinner messages: + msg126669
2011-01-21 00:02:06	loewis	set	nosy: loewis, belopolsky, vstinner messages: + msg126666
2011-01-20 22:23:20	vstinner	set	nosy: loewis, belopolsky, vstinner messages: + msg126656
2011-01-20 17:47:46	belopolsky	set	nosy: loewis, belopolsky, vstinner messages: + msg126632
2011-01-20 13:06:43	vstinner	set	nosy: loewis, belopolsky, vstinner messages: + msg126602
2011-01-20 11:07:22	vstinner	set	nosy: loewis, belopolsky, vstinner messages: + msg126597
2011-01-20 06:40:20	belopolsky	set	nosy: loewis, belopolsky, vstinner messages: + msg126592
2011-01-20 06:19:02	loewis	set	nosy: loewis, belopolsky, vstinner messages: + msg126590
2011-01-20 04:11:42	belopolsky	set	nosy: + loewis messages: + msg126587
2011-01-20 03:02:17	vstinner	set	nosy: belopolsky, vstinner messages: + msg126584
2011-01-20 02:31:42	belopolsky	set	nosy: belopolsky, vstinner messages: + msg126583
2011-01-20 02:22:07	vstinner	set	nosy: belopolsky, vstinner messages: + msg126582
2011-01-20 02:21:38	vstinner	set	nosy: belopolsky, vstinner messages: + msg126581
2011-01-20 02:18:38	belopolsky	set	nosy: belopolsky, vstinner messages: + msg126580
2011-01-20 02:00:16	vstinner	set	nosy: belopolsky, vstinner messages: + msg126579
2011-01-20 01:56:08	belopolsky	set	nosy: + belopolsky
2011-01-20 01:54:54	vstinner	create