This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Don't normalize module names to NFKC?
Type: Stage:
Components: Interpreter Core, Unicode Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, ezio.melotti, ishimoto, loewis, vstinner
Priority: normal Keywords:

Created on 2011-01-20 01:54 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
module_name.py vstinner, 2011-01-20 01:54
Messages (18)
msg126577 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-20 01:54
The Python 3 parser normalizes all identifiers using NFKC (as described in the PEP 3131). Examples:
 - U+00B5 (µ: Micro sign) is normalized to U+03BC (μ: Greek small letter mu)
 - U+FB03 (ffi: Latin small ligature ffi) is normalized to 'ffi'

The problem is that it does also normalize module names, but not the filename.

The module name in the Python source code is written with the keyboard (eg. U+00B5 in my case) and then normalized to NFKC (=> U+03BC). The filename is also written using the keyboard (U+00B5), but it is never normalized.

Attached script tests the current behaviour using "µTorrent" name with U+00B5 and U+03BC: import with U+00B5 or U+03BC use the filename with U+03BC.

The problem is that I'm able to write 'µ' (U+00B5) with my keyboard, but not U+03BC (μ).
msg126579 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-20 02:00
"µTorrent.py" filename example comes from #10754.

This issue is unrelated to the Python parser or the import machinery: it is a surprising behaviour of the MBCS codec which replaces unencodable characters to a similar glyph. I changed the MBCS in Python 3.2 to be strict (it now raises an error on unencodable character).
msg126580 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-01-20 02:18
This proposal makes sense because it would make

import µTorrent

behave the same as

µTorrent = __import__('µTorrent')

However, I think this is a feature request and a language change because the current grammar is

import_stmt     ::=  "import" module ..
module          ::=  (identifier ".")* identifier

and in order to implement the proposed feature, "module" will have to become a separate AST node that won't be treated as identifier.
msg126581 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-20 02:21
New problem: if the parser doesn't normalize module names on import, it does still normalize module names on other instructions.

Example: "import \xB5Torrent; del \xB5Torrent" raises an error on del because the parser normalized del identifier (the second module name) => "import \xB5Torrent; del \u03BCTorrent".
msg126582 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-20 02:22
See also #3080 (which is not directly related).
msg126583 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-01-20 02:31
On Wed, Jan 19, 2011 at 9:21 PM, STINNER Victor <report@bugs.python.org> wrote:
..
> New problem: if the parser doesn't normalize module names on import, it does still
> normalize module names on other instructions.
>
> Example: "import \xB5Torrent; del \xB5Torrent" raises an error on del because the parser
> normalized del identifier (the second module name) => "import \xB5Torrent; del \u03BCTorrent".
>

This won't be a problem if you make "import \xB5Torrent" behave as
"\xB5Torrent = __import__('\xB5Torrent')".  The latter is equivalent
to "\u03BCTorrent =  __import__('\xB5Torrent')".
msg126584 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-20 03:02
> This won't be a problem if you make 
> "import \xB5Torrent" 
> behave as (...)
> "\u03BCTorrent =  __import__('\xB5Torrent')"

"import name" is compiled to "IMPORT_NAME(name); STORE_NAME(name)" bytecode instructions. So you proposed to compile it to "IMPORT_NAME(name); STORE_NAME(normalized_name)" if name is different than the normalized name. Ok, I think that it is possible.
msg126587 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-01-20 04:11
Victor> Ok, I think that it is possible.

While it is possible, I am not sure it is a good idea.  For example, if a filesystem uses encoding that is capable of distinguishing between "\xB5Torrent.py" and "\u03BCTorrent.py", should "import \xB5Torrent" and "import \u03BCTorrent" import different modules?
msg126590 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-01-20 06:19
I think this issue falls into a similar category as support for case-insensitive but case-preserving file systems. Python uses regular file system lookups, but then may need to verify whether it got the right one.

I'd like to request that PEP 3131 is followed as it stands: identifier lookup uses NFKC, period. This gives two issues: a) how can users make sure that they name the files correctly? and b) what if the file system implementation mangles file names.

For b), I'd use the same approach as with case-insensitive lookups: verify that the file we read is really the one we want. For a), wrt. "I'm not able to write U+03BC with my keyboard", I say "tough luck - don't use that character in a module name, then". Somebody with a Greek keyboard will have no problems doing that. This is really the same as any other non-ASCII character which you are unable to type: it just means that you can't conveniently enter the respective Python identifier. Just try importing "саша", for example. Get a different keyboard.
msg126592 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-01-20 06:40
On Thu, Jan 20, 2011 at 1:19 AM, Martin v. Löwis <report@bugs.python.org> wrote:
..
> I'd like to request that PEP 3131 is followed as it stands: identifier lookup uses NFKC,
> period. This gives two issues: a) how can users make sure that they name the files
> correctly? and b) what if the file system implementation mangles file names.
>

There is also issue c) what if the filesystem encoding can only
represent a compatibility character, say U+00B5, but not its NFKC
equivalent, U+03BC?  Suppose you have a system with both locale and FS
encodings being Latin-1.  You can write Python code using Latin-1 and
the following is valid bytestream:

b'# encoding: latin-1\nimport \xB5Torrent\n"

However, this code will always fail because '\xB5Torrent' will be
normalized into '\u03BCTorrent' and a file named '\u03BCTorrent.py'
cannot be created on a filesystem with Latin-1 encoding.
msg126597 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-20 11:07
> b) what if the file system implementation mangles file names.
> 
> I'd use the same approach as with case-insensitive lookups: verify
> that the file we read is really the one we want.

Only Mac OS X and the HFS+ filesystem normalize filenames (to a variant
of NFD). But such normalization is a good thing! I mean that I don't
think that we have anything to do for that.

---
The user creates café.py file, name written with the keyboard in NFD:
cafe\u0301 (this is very unlikely, all operating systems prefer NFC for
the keyboard, but it's just to give an example). Mac OS X normalizes the
filename to NFD: cafe\u0301.py is created in the filesystem.

Then (s)he tries to import the café module: write "import café" with
his/her NFD keyboard. Python normalizes café to NFKC (caf\xe9) and then
tries to read caf\xe9.py. Mac OS X normalizes the filename to NFD: cafe
\u0301.py, and this file, so it works as expected.
---

I suppose that any filesystem normalization is good, because it avoids
surprising behaviours (eg. having two files cafe\u0301 and caf\xe9 with
names rendered exactly the same on screen). We should maybe patch
Windows, Mac OS, Linux & co to normalize to NFKC :-)

> a) how can users make sure that they name the files correctly?
>
>  For a), wrt. "I'm not able to write U+03BC with my keyboard", I say
> "tough luck - don't use that character in a module name, then".
> Somebody with a Greek keyboard will have no problems doing that. 

Even if I try to agree with "don't use that character in a module name":
it can be surprising for an English who would like to use µTorrent (U
+00B5) module name in his/her project. She/He can creates µTorrent.py
with his non-Greek keyboard (\xb5Torrent.py), but than import µTorrent
(import \xb5Torrent) fails: "ImportError: No module named µTorrent". The
error message is "ImportError: No module named \u03BCTorrent": the
identifier is normalized, but remember that µ (U+00B5) and μ (U+03BC)
are rendered exactly the same by most fonts.

We should at least document this surprising behaviour in the import
documentation. Something like:

<< WARNING: Non-ASCII characters in module names are normalized to NFKC
by the Python parser ([PEP 3131]). For example, import µTorrent (µ: U
+00B5) is normalized to import μTorrent (μ: U+03BC): Python will try to
open "\u03BCTorrent.py" (or "\u03BCTorrent/__init__.py"), and not
"\xB5Torrent.py" (or "\xB5Torrent/__init__.py"). >>

> This is really the same as any other non-ASCII character which you are
> unable to type: it just means that you can't conveniently enter the
> respective Python identifier. Just try importing "саша", for example.
> Get a different keyboard.

I disagree. For identifiers in the source code, it works (transparently)
as expected.

A Greek starts a project using µTorrent (\u03BCTorrent) identifier in
its source code (a variable name, not a module name). An English writes
a patch using µTorrent written with \xB5Torrent: both forms are accepted
by Python, and it works.

"exec"))
it works
msg126602 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-20 13:06
> There is also issue c) what if the filesystem encoding can only
> represent a compatibility character, say U+00B5, but not its NFKC
> equivalent, U+03BC?

It is the same problem than not being able to write U+03BC with a keyboard: in this setup, don't use U+00B5 or U+03BC. More generally: don't use non-ASCII characters if your setup is not fully Unicode compliant, or fix your setup :-)
msg126632 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-01-20 17:47
On Thu, Jan 20, 2011 at 8:06 AM, STINNER Victor <report@bugs.python.org> wrote:
..
>> There is also issue c) what if the filesystem encoding can only
>> represent a compatibility character, say U+00B5, but not its NFKC
>> equivalent, U+03BC?
>
> It is the same problem than not being able to write U+03BC with a keyboard:

No.  This is a different problem and I agree with Martin that keyboard
limitations are not an issue.  With proper tools one can create
'\u03BCTorrent.py" file even if the keyboard does not have a '\u03BC'
key as long as the filesystem is capable of storing such file.  Python
itself is one such tool:

>>> with open('\u03BCTorrent.py'.encode(fsencoding), 'w') as f: ...

However, if fsencoding = 'latin-1', the code above will fail.

One possible solution to this problem is to define a 'compat' error
handler that would detect unencodable strings with encodable
compatibility equivalents and produce encoding of an NFKC equivalent
string instead of raising an error.  ISTM, that in the Latin-1
encoding, there are only five affected characters:

...     dec = decomposition(chr(i))
...     if dec and dec.startswith('<compat>'):
...        print("U+00%02X '%s' (%s): %s" %(i, chr(i), name(chr(i)), dec))
...
U+00A8 '¨' (DIAERESIS): <compat> 0020 0308
U+00AF '¯' (MACRON): <compat> 0020 0304
U+00B4 '´' (ACUTE ACCENT): <compat> 0020 0301
U+00B5 'µ' (MICRO SIGN): <compat> 03BC
U+00B8 '¸' (CEDILLA): <compat> 0020 0327

I suspect that the number of affected characters in the other
encodings is similarly small.  If we further limit special handling to
characters that are valid in identifiers, U+00B5 will end up being the
only such character in Latin-1.

An import mechanism using encode(fsencoding, 'compat') will, when
given either "import \u00B5Torrent" or  "import \u03BCTorrent" in
source file, open  "\u03BCTorrent.py" when fsencoding='utf-8'  and
"\u00B5Torrent.py" if fsencoding='latin-1'.   A packaging mechanism
that prepares code developed on a Latin-1 filesystem for distribution,
would have to NFKC-normalize filenames before encoding them using
UTF-8.
msg126656 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-20 22:23
> A packaging mechanism that prepares code developed on a Latin-1
> filesystem for distribution, would have to NFKC-normalize 
> filenames before encoding them using UTF-8.

It causes portability issues: if you copy a non-ASCII module on a new
host, the program will work or not depending on the filesystem encoding.
Having to transform the filename when you copy a file, just to fix a
corner case, is a pain.

> One possible solution to this problem is to define a 'compat' error
> handler that would detect unencodable strings with encodable
> compatibility equivalents and produce encoding of an NFKC equivalent
> string instead of raising an error.

Only few people use non-ASCII module names and most operating systems
are able to store all Unicode characters, so I don't think that we need
to support U+00B5 in a module name with Latin1 filesystem at all. If you
use an old system using Latin1 filesystem, you have to limit your
expectation on Python unicode support :-)

os.fsencode() and os.fsdecode() already use a custom error handler:
surrogateescape. compat will conflict with surrogateescape. Loading a
module concatenates two parts: a path from sys.path (decoded from the
filesystem encoding and surrogateescape error handler) and a module
name. If custom is used to encode the filename, the module name will be
encoded correctly, but not the path.
msg126666 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-01-21 00:02
> There is also issue c) what if the filesystem encoding can only
> represent a compatibility character, say U+00B5, but not its NFKC
> equivalent, U+03BC?

That should be considered as similar to file systems that just cannot
represent certain characters at all - e.g. many of the non-ASCII
characters, or no upper-case letters. If you have such a file system,
you just cannot use these characters in a module name. Rename your
modules, then, or put the modules in a zipfile (or use some other
import hook).

> However, this code will always fail because '\xB5Torrent' will be
> normalized into '\u03BCTorrent' and a file named '\u03BCTorrent.py'
> cannot be created on a filesystem with Latin-1 encoding.

Tough luck. The filesystem just doesn't support GREEK SMALL LETTER MU,
just as it doesn't support all the other greek characters.

It may be fun coming up with these border cases. But I really don't
see a need to support them. If you really need to have that letter
in a module name, reformat your disk with a better file system.
msg126669 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-01-21 00:07
> Only Mac OS X and the HFS+ filesystem normalize filenames (to a variant
> of NFD). But such normalization is a good thing! I mean that I don't
> think that we have anything to do for that.

That may well be - I don't have a case where this would cause problems,
either.

> We should at least document this surprising behaviour in the import
> documentation.

There are also are better ways to support the user than mere
documentation. For example,, the exception message could be more
helpful, and IDLE could warn the user when saving the file in the
first place.

> << WARNING: Non-ASCII characters in module names are normalized to NFKC
> by the Python parser ([PEP 3131]). For example, import µTorrent (µ: U
> +00B5) is normalized to import μTorrent (μ: U+03BC): Python will try to
> open "\u03BCTorrent.py" (or "\u03BCTorrent/__init__.py"), and not
> "\xB5Torrent.py" (or "\xB5Torrent/__init__.py"). >>

I can't believe this is a real problem. I'd defer warning about made-up
problems until real users report them as a real problem.

> I disagree.

If you disagree strongly, please write a PEP.
msg127183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-27 12:35
It looks like there is nothing interesting to do here, so I close the issue (which is not a bug :-)).
msg182911 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2013-02-25 02:51
Converting identifiers to NFKC is problematic to work with FULLWIDTH letters such as 'a'(FULLWIDTH LATIN SMALL LETTER A).

We can create module named 'aaa.py', but this module could not be imported on all platforms I know.

>>> import aaa
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'aaa'

Talking about Japanese environment, I don't see benefit to normalize variable names. FULLWIDTH/HALFWIDTH compatibility characters are commonly used here, and they are recognized different characters.  It would be too late to argue, but converting to normal form NKC instead of NFKC would be better. Python distinguishes small letters and large letters, but doesn't distinguish fullwidth and halfwidth. This is a pretty surprising behavior to me.
History
Date User Action Args
2022-04-11 14:57:11adminsetgithub: 55161
2013-02-25 03:03:01ezio.melottisetnosy: + ezio.melotti

versions: - Python 3.1
2013-02-25 02:51:34ishimotosetmessages: + msg182911
2013-02-25 00:26:54ishimotosetnosy: + ishimoto
2011-01-27 12:35:25vstinnersetstatus: open -> closed

messages: + msg127183
resolution: not a bug
nosy: loewis, belopolsky, vstinner
2011-01-21 00:07:58loewissetnosy: loewis, belopolsky, vstinner
messages: + msg126669
2011-01-21 00:02:06loewissetnosy: loewis, belopolsky, vstinner
messages: + msg126666
2011-01-20 22:23:20vstinnersetnosy: loewis, belopolsky, vstinner
messages: + msg126656
2011-01-20 17:47:46belopolskysetnosy: loewis, belopolsky, vstinner
messages: + msg126632
2011-01-20 13:06:43vstinnersetnosy: loewis, belopolsky, vstinner
messages: + msg126602
2011-01-20 11:07:22vstinnersetnosy: loewis, belopolsky, vstinner
messages: + msg126597
2011-01-20 06:40:20belopolskysetnosy: loewis, belopolsky, vstinner
messages: + msg126592
2011-01-20 06:19:02loewissetnosy: loewis, belopolsky, vstinner
messages: + msg126590
2011-01-20 04:11:42belopolskysetnosy: + loewis
messages: + msg126587
2011-01-20 03:02:17vstinnersetnosy: belopolsky, vstinner
messages: + msg126584
2011-01-20 02:31:42belopolskysetnosy: belopolsky, vstinner
messages: + msg126583
2011-01-20 02:22:07vstinnersetnosy: belopolsky, vstinner
messages: + msg126582
2011-01-20 02:21:38vstinnersetnosy: belopolsky, vstinner
messages: + msg126581
2011-01-20 02:18:38belopolskysetnosy: belopolsky, vstinner
messages: + msg126580
2011-01-20 02:00:16vstinnersetnosy: belopolsky, vstinner
messages: + msg126579
2011-01-20 01:56:08belopolskysetnosy: + belopolsky
2011-01-20 01:54:54vstinnercreate