classification
Title: "Full unicode import system" not in 3.2
Type: Stage: resolved
Components: Documentation Versions: Python 3.2
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, eric.araujo, georg.brandl, jh45, serhiy.storchaka, tchrist, vstinner
Priority: normal Keywords:

Created on 2011-02-17 07:10 by jh45, last changed 2017-07-19 05:20 by serhiy.storchaka. This issue is now closed.

Messages (9)
msg128711 - (view) Author: John (jh45) Date: 2011-02-17 07:10
A few months ago I read that in 3.2 it will be possible to import modules that are located on paths containing any unicode character. (more precisely, with chars not in the local code page)

After an hour or two trying to get this to work in 3.2rc3, I went looking for clues, and found these 2 messages in which Victor Stinner says this feature is delayed until Python 3.3:
http://bugs.python.org/issue3080#msg126514
http://bugs.python.org/issue10828#msg125787

Could you please make it clear in documentation and web pages, that this feature is not working yet. 

The Python 3.2 download page includes this:
"countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables)"
and I guessed this must cover importing from any unicode path, as there was no mention that such importing had been abandoned for this version.

-- jh
msg128724 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-02-17 11:29
Short answer:

In Python 3.2, « import héhé » doesn't work on Windows, but you can have non-ASCII paths in sys.path.

Longer answer:

I fixed the import machinery to handle correctly non-ASCII characters in module *paths*. But the import machinery is unable to handle non-ASCII characters in module *names*: it fails if the filesystem encoding is not UTF-8 (eg. it fails on Windows). There is another exception: Python doesn't support (yet) non encodable module paths on Windows. On Windows, you can use any character in directory names, but Python 3.2 encodes paths to the filesystem encoding (ANSI code page) which is a smaller charset. In practical, this Windows specific limitation on module paths doesn't really matter.

I plan to fix all these issues in Python 3.3: see #3080.

--

> Could you please make it clear in documentation and web pages,
> that this feature is not working yet. 

What's New in Python 3.2 documentation has this sentence: "Python’s import mechanism can now load modules installed in directories with non-ASCII characters in the path name. This solved an aggravating problem with home directories for users with non-ASCII characters in their usernames." which is correct.

Which web page should updated/fixed?
msg128774 - (view) Author: John (jh45) Date: 2011-02-18 04:45
Victor asked "Which web page should updated/fixed?"

Answer: The Python 3.2 download page.

But what should it say?

The main point is that people like me, who remember seeing a statement about this a few months ago, may expect unicode to work in every conceivable situation, and a prominent warning that it's not *all* fixed yet, with a link to details in the documentation, would save them from trying things that don't work.

By the way, I hadn't grasped a simple point from issue 3080: I tested on *English* Windows by putting a Greek character in the path to some python modules. But the usual situation is where a *Greek* version of Windows has some Greek characters in the path, and from what you just wrote, that's OK now.

-- jh
msg131599 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-21 01:22
>> Victor asked "Which web page should updated/fixed?"
> Answer: The Python 3.2 download page.

Sorry, but I don't see which page tells that Python 3.2 has a full Unicode support for import. In http://www.python.org/download/releases/3.2/, I can read "countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables)".

"full support for a bytes environment" means that Python 3.2 has been fixed on UNIX to support undecodable filenames, but not that Python 3.2 supports unencodable filenames on Windows.

Can you propose a sentence which is more clear about bytes/Unicode?

Python 3.3 will have a full Unicode support for modules: issue #3080 is already fixed, and I think that #11619 can be fixed (maybe not easily).
msg139923 - (view) Author: John (jh45) Date: 2011-07-06 05:48
Sorry for the long delay.

haypo wrote:
  Can you propose a sentence which is more clear about bytes/Unicode?

On this page:
http://www.python.org/download/releases/3.2/
is this line:
"- countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables)"

How about adding to that line something like:
" on UNIX; but on Windows the path to and name of each module you import can contain only characters that are in the ANSI codepage that your Windows is using"

and maybe
" (will be fixed in Python 3.3)"

and maybe (or not) also something like:
" (ANSI codepage = basic latin + other characters of only your own language group)"

-- jh
msg141937 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-12 02:36
How does this work for modules that have filesystem names different from the one used for import? The issue I'm thinking about is that the Mac HSF+ filesystem keeps its Unicode filenames in (close to) NFD form. That means that a module named "caf\N{LATIN SMALL LETTER E WITH ACUTE}" with 4 graphemes and 4 code points in its name winds up in the filesystem as "cafe\N{COMBINING ACUTE ACCENT}" still with 4 graphemes but now with 5 code points.

I believe (well, suspect; I have empirical evidence not proof) Python stores its own identifiers in NFD, so this may not be quite as much of a problem as it might otherwise be.  Nonetheless, I have had users complain about what HFS+ does with such filenames, although I am not quite sure why. I think it’s because they access a file with 4 chars but they need a 5-char fileglob to wildcard it, so touch "caf\N{LATIN SMALL LETTER E WITH ACUTE}" and then you need a wildcard of "?????" with an extra ? to find it. Kinda weird.
msg141953 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-12 13:13
Whoops, I meant that it appears that Python runs its identifiers through NFC.  How that gets along with a filesystem that has quasi-NFD filenames I'm not sure, but it seems like it might be a variant of the case-insensitivity issue in filenames.
msg141954 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-08-12 13:18
> The issue I'm thinking about is that the Mac HSF+ filesystem

There is no issue with HFS+ normalization. The kernel "normalizes" filenames to its own variant, Python doesn't have to care about this.

When you write "import h<é normalized to NFC>" or "import h<é normalized to NFD>", Python tries to open "h<é normalized to NFC>.py": then the HFS+ filename does its own normalization (=> "h<é normalized to its variant of NFD>.py").
msg298632 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-07-19 05:20
Python 3.2 is out of maintenance. Full Unicode path support was added in Python 3.3 by issue3080.
History
Date User Action Args
2017-07-19 05:20:50serhiy.storchakasetstatus: open -> closed

nosy: + serhiy.storchaka
messages: + msg298632

resolution: out of date
stage: resolved
2011-08-12 13:18:17vstinnersetmessages: + msg141954
2011-08-12 13:13:09tchristsetmessages: + msg141953
2011-08-12 02:36:31tchristsetnosy: + tchrist
messages: + msg141937
2011-07-08 00:46:03ned.deilysetnosy: + georg.brandl
2011-07-06 05:48:04jh45setmessages: + msg139923
2011-03-23 12:29:31eric.araujosetnosy: + eric.araujo
2011-03-21 01:22:19vstinnersetnosy: vstinner, docs@python, jh45
messages: + msg131599
2011-02-18 04:45:49jh45setnosy: vstinner, docs@python, jh45
messages: + msg128774
2011-02-17 11:29:39vstinnersetnosy: vstinner, docs@python, jh45
messages: + msg128724
2011-02-17 11:18:48pitrousetnosy: + vstinner
2011-02-17 07:10:35jh45create