classification
Title: [Python2] Use utf-8 in the import machinery on Windows to support unicode paths
Type: Stage: resolved
Components: Interpreter Core Versions: Python 2.6
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, anthonybaxter, brett.cannon, eric.araujo, ezio.melotti, kristjan.jonsson, loewis, nnorwitz, theller, vstinner
Priority: normal Keywords: patch

Created on 2006-09-05 18:11 by kristjan.jonsson, last changed 2010-10-19 02:21 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
Unicodeimport3.patch kristjan.jonsson, 2006-09-05 18:11 patch for 2.6 to provide unicode imports
Unicodeimport4.patch kristjan.jonsson, 2007-04-17 10:38 An updated patch for unicode import
Messages (23)
msg51081 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2006-09-05 18:11
This patch modifies the import mechanism to fully
support unicode pathnames on Windows.  It does this by
first converting each member of sys.path to utf-8. 
strings are encoded using the current locale.

The whole of the import logic is then unchanged and
works on the utf-8 strings as though they were regular
ascii strings in the current locale.

Only when file operations are done, such as stat() and
open(), do we then convert from utf-8 back  to unicode
and use the Windows unicode APIs for the job.  This is
also done when initializing Module objects.

This approach has the benefit of being of having a low
impact on the importing logic, and is thus easy to
verify.  There is however some overhead with the
conversions.

At CCP games we used this approach, backported to
python 2.3, to get unicode imports working for our
game, EVE Online, and thereby solving installation
issues in the far east.


This patch is submitted as demonstration code to the
python community.  I would like to see unicode fully
supported in 2.6.

Cheers,
Kristján
msg51082 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-09-08 21:03
Logged In: YES 
user_id=21627

What is the value of the __file__ attribute of a module when
this patch is used?
msg51083 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2006-09-09 11:38
Logged In: YES 
user_id=1262199

From the top of my head, it is now unicode.  I consider
trying to convert it back to the default encoding but
decided not to to keep the patch brief.  
msg51084 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-09-09 12:31
Logged In: YES 
user_id=21627

First: Do you want to continue to work on this, or do you
consider this just "demonstration code" (i.e. not
contributed for inclusion in Python), hoping that somebody
else implements this feature?

I think the behavior of __file__ must be more consistent
across platforms, and the selected behaviour must be
documented somewhere. Several definitions of "consistent
behavior" come to mind:
1. __file__ is always a Unicode string
2. __file__ is a byte string if its ASCII, else Unicode
3. __file__ is a byte string if its in the system encoding,
else Unicode
4. __file__ is a byte string if its in the file system
encoding, else Unicode.

The documentation needs to be updated in several places,
e.g. also for inspect.getfile.

I would expect that pydoc would also need to be updated.

Selecting from the options above: I believe 4 is most
compatible with previous versions; 1 and 2 are most
convenient to work with in applications like pydoc which
have to generate HTML (1 is easier to work with, 2 is more
compatible with previous versions).
msg51085 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2006-09-12 09:38
Logged In: YES 
user_id=1262199

I submitted this mostly as a demonstration.  I don't think
the approach is necessarily suitable for a final
implementation because of the use of utf-8 as an
intermediate representation and the price of the conversions
that keep happening.  But perhaps this is the way to go, if
we consider utf-8 to be a stage-1 default file system
encoding for win32.

I also agree that 4 is probably the most sensible approach.
 What about discrepancies between e.g. linux and windows
then, when including from a non-trivial path?  On linux we
would get utf-8, on windows unicode?

1) would actually make a lot of sense, only in my experience
this tends to lead to a kind of unicode-hell since a program
touched by one unicode object tends to have it percolating
down into every corner.
msg51086 - (view) Author: Anthony Baxter (anthonybaxter) (Python triager) Date: 2006-09-12 11:29
Logged In: YES 
user_id=29957

There's a variety of modules in the standard library that
reference __file__ - if it's potentially going to be a
unicode string, these are going to need to be checked, as
are their callers :-/

(Now that I've looked closer at some of the issues, I'm
extremely glad this didn't go into 2.5 final at this late stage)
msg51087 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-09-12 20:17
Logged In: YES 
user_id=21627

krisvale: indeed, option 4 is platform dependent. Notice
that on Linux, the file system encoding won't necessarily be
UTF-8. Instead, the value depends on the locale, so it may
be latin-1, latin-9, gb2312, ... This makes it even more
dependent on the platform, and even the current user being
logged in (such is life with locale-based approaches; the
same is mostly true for Windows: "mbcs" can mean nearly
anything).

option 1) is Py3k-safe, where path names will be Unicode
strings always. As you say, Unicode is a virulent type, so
this approach would need a wide consensus.

I'm personally leaning towards option 2: it is nearly
backwards compatible, except for obscure cases where people
have mbcs-encodable entries in sys.path already, and it is
independent of manipulations of the system encoding.

I also think that processing of PYTHONPATH should take
Unicode into account, i.e. we should use _wgetenv to access
PYTHONPATH in 2.6. That would make the feature truly useful,
as then people could actually set sys.path to non-mbcs
directlories from the outside. Notice that W9x support can
be dropped in 2.6, so a W9x-compatible solution won't be
required.

In any case, I'd like to encourage you to continue working
on this issue. I, too, like to see it in 2.6, but I did so
ever since 2.1 or so (before PEP 277 was implemented), and
it was wishful thinking. Somebody has to take action, and it
is likely that it won't one of the past regular contributors
(or else they had contributed it long ago - although I think
Thomas Heller had something working at one point).
msg51088 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2007-04-17 10:38
I have uplodaed unicodepatch4.patch, which simplifies this a bit.  __file__ and __path__ components are now stored in filesystemencoding if possible.  and non-unicode paths are assumed to be in filesystemencoding.  This minimizes the impact of the change.
File Added: Unicodeimport4.patch
msg51089 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2007-04-19 07:12
Any function which is not static to a file, must be prefixed with Py or _Py.  There are several lines which are over 80 columns and should be wrapped.  Why is errno set in open_utf8, etc?

Indentation was messed up at least in one place in Objects/moduleobject.c on a DECREF line.

I can't provide any guidance on the windows specific code.  Where do _wstat and _wfopen come from?  There isn't a man page on my Unix box.  I'm not sure if the exist in a library anywhere.  I didn't see any changes to configure to verify if these exist or not.  If Py_UNICODE_IMPORT, does that necessarily mean these APIs exist?  (It's possible this code was inside an #if WINDOWS and I couldn't tell from the patch.)
msg81636 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-02-11 10:33
Ah, this one is still alive?
We still use this patch at CCP for our 2.x python.  I'll give it some 
more love to answer the issues raised.
Hm, is this still an issue with 3.x?  Does the imput machinery use 
unicode as the internal format when working with the import paths?
msg114818 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-24 20:29
I think #9425 supercedes this. Am I correct?
msg114820 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-08-24 20:38
> I think #9425 super*s*edes this. Am I correct?

#8611 or #9425, as you want. Anyway, I'm working on this topic and I will try to fix it before Python 3.2 release.
msg114830 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2010-08-24 21:06
Possibly.  I made a comment in issue 9425 explaining the particular trick that this here patch makes (using utf-8 as an intermediate form to avoid having to change all the machinery in import.c)
msg115283 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-08-31 22:43
utf-8 codec (in strict mode) rejects surrogates in python3, and so you doesn't support undecodable filenames (filenames decoded using surrogateescape error handler which produces surrogate characters). It may be possible if you use surrogateescape everywhere.

Manipulate encoded filenames is not trivial because it may quickly lead to mojibake if the encodings are different (eg. if sys.path contains a bytes filename, you have to be careful). Use utf-8 means that you have to decode and then reencode (to the filesystem encoding) a filename before passing it to a system call (eg. mkdir()). #8611 problem is that Python3 doesn't work if the filesystem is *not* utf-8.

You solution is attractive because it is short, but I prefer to use directly the right solution to not patch Python twice: use unicode (with surrogates, PEP 383, for undecodable filenames) everywhere.
msg115284 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2010-09-01 01:23
I conffess that I didn't follow the utf-8/surrogate discussion.
But the utf-8 encoding can encode all valid unicode characters:

UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above. (from wikipedia)

If we encounter surrogate halves when encoding (unicode) to utf-8, it means that we are really trying to decode utf-16 and reencode it as utf-8.  (and that python is using 16 bits for its unicode chars).  the utf--8 codec should be smart enough to merge the surrogates into a utf-32 char, and encode that.

Anyway, as you remark, my approach is a _patch_, designed to make python (2.x) work in an unicode environment, with the least amount of code change, for those willing to commit such a patch.  In 3.x you may want to do things differently.
msg115329 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-01 19:33
> According to the Unicode standard the high and low surrogate halves used
> by UTF-16 (...)

Yes, but in Python, U+DC80..D+DCFF range is used to store undecodable bytes. 
Eg. 'abc\xff'.decode('ascii', 'surrogateescape') gives 'abc\udcff'.

> Anyway, as you remark, my approach is a _patch_, designed to make python
> (2.x) work in an unicode environment, with the least amount of code
> change, for those willing to commit such a patch.

Python 2.7 is out and I think it is too late to fix Python2. Anyway, Python2 
uses bytes for sys.path or other paths, so the problem only occurs if the user 
specifies unicode paths.

> In 3.x you may want to do things differently.

I choosed to rewrite the C code to manipulate unicode paths instead of byte 
paths => #9425
msg115354 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2010-09-02 01:37
> Yes, but in Python, U+DC80..D+DCFF range is used to store undecodable bytes. 
> Eg. 'abc\xff'.decode('ascii', 'surrogateescape') gives 'abc\udcff'.

That's an inventive way of breaking the unicode standard :)
Anyway, why would you worry about that?  My patch doesn't use "surrogateescape" so there is no problem.  There are only two places where I "decode":  
1) module names and sys.path components in the system file encoding:  If they contain undecodable characters, then that is an error.  No reason to propagate that error into the import machinery.
2) when decoding utf-8 back into unicode, but that utf-8 is already leagal since _we_ generated it.

If a _unicode_ input (sys.path) contains a valid surrogate pair, then the utf-8 encoder just encodes it.
But if it finds a lone surrogate as you describe (python special) then that represends an undecodable chacater, something that should have been covered earlier and something we know nothing about.  Clearly, that makes that particular unicode sys.path component invalid.

(Hm, I notice that 2.7 happily encodes lone surrogates to utf-8)

> Python 2.7 is out and I think it is too late to fix Python2. Anyway, Python2 
> uses bytes for sys.path or other paths, so the problem only occurs if the user 
> specifies unicode paths.
Which is precisely the case that it is designed to solve.  When the chinese user installs EVE Online in a weird folder, then that should work.
Also, 2.x is not quite dead yet.  There are quite a few people doing their own patches for their private purposes.  Although my patch won't go into any official version, there might be others in the same situation like us:  Trying to support an _embedded_ python 2.x version in an internationalized enverionment (on windows :)
msg115553 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-04 01:00
Oh, I didn't see that the issue was specific to Python2. I updated the issue's title. If I understood correctly, the issue is also specific to Windows.

Do you know if your patch changes the public API? (break the compatibility)

--

FYI about Python3:

> That's an inventive way of breaking the unicode standard :)

It is described in the PEP 383 and it does solve a real and common issue: store a filename that cannot be decoded with the filesystem encoding. The operation is reversible. In Python 3.2, there are os.fsdecode() and os.fsencode() functions. On UNIX/BSD, os.encode(os.fsdecode(x)) is x, if x is a bytes object.

The PEP 383 introduces the surrogateescape error handler which does create surrogates on decode, and convert back surrogates to bytes on encode.

> Anyway, why would you worry about that? My patch doesn't use
> "surrogateescape" so there is no problem.

In Python3, filenames are stored as unicode. On UNIX/BSD, if a filename cannot be decode, it is encoded with surrogates. To get a full unicode support in Python3, you have to support surrogates.
msg115575 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-09-04 14:44
As this was never meant for inclusion in Python, and apparently confuses people, I'm closing it - it couldn't go into 2.x, anyway.
msg115683 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2010-09-06 01:47
Well, it was, originally, but it met with so little interest that I couldn't be bothered to polish it to inclusion standards.  Anyway, there was the incompatibility problem of what to do with the __file__ attribute, and the fact that the patch was Windows only.

Do we have a place where we can put in working patches for people to use at their own risk, without going through all the hoops of a successful python.org checkin?
msg115686 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-09-06 01:58
There is no such place that I know of, sorry.
msg115691 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-09-06 06:25
Having patches in the tracker is fine to me. Even if the patch is closed, it's still available.

Of course, there are many ways to publish code on the net: you could post the patch to Rietveld, to the Python wiki, or publish an entire clone to bitbucket.
msg119109 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-19 02:21
FYI, I finished my work on non-ascii filenames in Python 3.2 (#8611, #9425): Python 3.2 now suports any filename with any locale (filesystem) encoding.
History
Date User Action Args
2010-10-19 02:21:59vstinnersetmessages: + msg119109
2010-09-06 06:25:25loewissetmessages: + msg115691
2010-09-06 01:58:21eric.araujosetmessages: + msg115686
stage: resolved
2010-09-06 01:47:38kristjan.jonssonsetmessages: + msg115683
2010-09-04 14:44:13loewissetstatus: open -> closed
resolution: out of date
messages: + msg115575
2010-09-04 01:00:39vstinnersetmessages: + msg115553
title: Unicode Imports -> [Python2] Use utf-8 in the import machinery on Windows to support unicode paths
2010-09-02 01:37:12kristjan.jonssonsetmessages: + msg115354
2010-09-01 19:34:10eric.araujosetnosy: + eric.araujo
2010-09-01 19:33:12vstinnersetmessages: + msg115329
2010-09-01 01:23:13kristjan.jonssonsetmessages: + msg115284
2010-08-31 22:43:50vstinnersetmessages: + msg115283
2010-08-24 21:06:03kristjan.jonssonsetmessages: + msg114830
2010-08-24 20:38:54vstinnersetnosy: + vstinner
messages: + msg114820
2010-08-24 20:29:43BreamoreBoysetnosy: + BreamoreBoy
messages: + msg114818
2009-04-01 18:41:18brett.cannonsetassignee: brett.cannon ->
2009-02-11 11:31:20thellersetnosy: + theller
2009-02-11 10:33:17kristjan.jonssonsetmessages: + msg81636
2009-02-11 03:13:12ajaksu2setassignee: brett.cannon
nosy: + brett.cannon, ezio.melotti
2006-09-05 18:11:31kristjan.jonssoncreate