Issue 1552880: [Python2] Use utf-8 in the import machinery on Windows to support unicode paths

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43943

classification

Title:	[Python2] Use utf-8 in the import machinery on Windows to support unicode paths
Type:		Stage:	resolved
Components:	Interpreter Core	Versions:	Python 2.6

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:		Nosy List:	BreamoreBoy, anthonybaxter, brett.cannon, eric.araujo, ezio.melotti, kristjan.jonsson, loewis, nnorwitz, theller, vstinner
Priority:	normal	Keywords:	patch

Created on 2006-09-05 18:11 by kristjan.jonsson, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
Unicodeimport3.patch	kristjan.jonsson, 2006-09-05 18:11	patch for 2.6 to provide unicode imports
Unicodeimport4.patch	kristjan.jonsson, 2007-04-17 10:38	An updated patch for unicode import

Messages (23)
msg51081 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2006-09-05 18:11
This patch modifies the import mechanism to fully support unicode pathnames on Windows. It does this by first converting each member of sys.path to utf-8. strings are encoded using the current locale. The whole of the import logic is then unchanged and works on the utf-8 strings as though they were regular ascii strings in the current locale. Only when file operations are done, such as stat() and open(), do we then convert from utf-8 back to unicode and use the Windows unicode APIs for the job. This is also done when initializing Module objects. This approach has the benefit of being of having a low impact on the importing logic, and is thus easy to verify. There is however some overhead with the conversions. At CCP games we used this approach, backported to python 2.3, to get unicode imports working for our game, EVE Online, and thereby solving installation issues in the far east. This patch is submitted as demonstration code to the python community. I would like to see unicode fully supported in 2.6. Cheers, Kristján
msg51082 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-09-08 21:03
Logged In: YES user_id=21627 What is the value of the __file__ attribute of a module when this patch is used?
msg51083 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2006-09-09 11:38
Logged In: YES user_id=1262199 From the top of my head, it is now unicode. I consider trying to convert it back to the default encoding but decided not to to keep the patch brief.
msg51084 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-09-09 12:31
Logged In: YES user_id=21627 First: Do you want to continue to work on this, or do you consider this just "demonstration code" (i.e. not contributed for inclusion in Python), hoping that somebody else implements this feature? I think the behavior of __file__ must be more consistent across platforms, and the selected behaviour must be documented somewhere. Several definitions of "consistent behavior" come to mind: 1. __file__ is always a Unicode string 2. __file__ is a byte string if its ASCII, else Unicode 3. __file__ is a byte string if its in the system encoding, else Unicode 4. __file__ is a byte string if its in the file system encoding, else Unicode. The documentation needs to be updated in several places, e.g. also for inspect.getfile. I would expect that pydoc would also need to be updated. Selecting from the options above: I believe 4 is most compatible with previous versions; 1 and 2 are most convenient to work with in applications like pydoc which have to generate HTML (1 is easier to work with, 2 is more compatible with previous versions).
msg51085 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2006-09-12 09:38
Logged In: YES user_id=1262199 I submitted this mostly as a demonstration. I don't think the approach is necessarily suitable for a final implementation because of the use of utf-8 as an intermediate representation and the price of the conversions that keep happening. But perhaps this is the way to go, if we consider utf-8 to be a stage-1 default file system encoding for win32. I also agree that 4 is probably the most sensible approach. What about discrepancies between e.g. linux and windows then, when including from a non-trivial path? On linux we would get utf-8, on windows unicode? 1) would actually make a lot of sense, only in my experience this tends to lead to a kind of unicode-hell since a program touched by one unicode object tends to have it percolating down into every corner.
msg51086 - (view)	Author: Anthony Baxter (anthonybaxter)	Date: 2006-09-12 11:29
Logged In: YES user_id=29957 There's a variety of modules in the standard library that reference __file__ - if it's potentially going to be a unicode string, these are going to need to be checked, as are their callers :-/ (Now that I've looked closer at some of the issues, I'm extremely glad this didn't go into 2.5 final at this late stage)
msg51087 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-09-12 20:17
Logged In: YES user_id=21627 krisvale: indeed, option 4 is platform dependent. Notice that on Linux, the file system encoding won't necessarily be UTF-8. Instead, the value depends on the locale, so it may be latin-1, latin-9, gb2312, ... This makes it even more dependent on the platform, and even the current user being logged in (such is life with locale-based approaches; the same is mostly true for Windows: "mbcs" can mean nearly anything). option 1) is Py3k-safe, where path names will be Unicode strings always. As you say, Unicode is a virulent type, so this approach would need a wide consensus. I'm personally leaning towards option 2: it is nearly backwards compatible, except for obscure cases where people have mbcs-encodable entries in sys.path already, and it is independent of manipulations of the system encoding. I also think that processing of PYTHONPATH should take Unicode into account, i.e. we should use _wgetenv to access PYTHONPATH in 2.6. That would make the feature truly useful, as then people could actually set sys.path to non-mbcs directlories from the outside. Notice that W9x support can be dropped in 2.6, so a W9x-compatible solution won't be required. In any case, I'd like to encourage you to continue working on this issue. I, too, like to see it in 2.6, but I did so ever since 2.1 or so (before PEP 277 was implemented), and it was wishful thinking. Somebody has to take action, and it is likely that it won't one of the past regular contributors (or else they had contributed it long ago - although I think Thomas Heller had something working at one point).
msg51088 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2007-04-17 10:38
I have uplodaed unicodepatch4.patch, which simplifies this a bit. __file__ and __path__ components are now stored in filesystemencoding if possible. and non-unicode paths are assumed to be in filesystemencoding. This minimizes the impact of the change. File Added: Unicodeimport4.patch
msg51089 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2007-04-19 07:12
Any function which is not static to a file, must be prefixed with Py or _Py. There are several lines which are over 80 columns and should be wrapped. Why is errno set in open_utf8, etc? Indentation was messed up at least in one place in Objects/moduleobject.c on a DECREF line. I can't provide any guidance on the windows specific code. Where do _wstat and _wfopen come from? There isn't a man page on my Unix box. I'm not sure if the exist in a library anywhere. I didn't see any changes to configure to verify if these exist or not. If Py_UNICODE_IMPORT, does that necessarily mean these APIs exist? (It's possible this code was inside an #if WINDOWS and I couldn't tell from the patch.)
msg81636 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2009-02-11 10:33
Ah, this one is still alive? We still use this patch at CCP for our 2.x python. I'll give it some more love to answer the issues raised. Hm, is this still an issue with 3.x? Does the imput machinery use unicode as the internal format when working with the import paths?
msg114818 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-24 20:29
I think #9425 supercedes this. Am I correct?
msg114820 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-08-24 20:38
> I think #9425 supersedes this. Am I correct? #8611 or #9425, as you want. Anyway, I'm working on this topic and I will try to fix it before Python 3.2 release.
msg114830 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-08-24 21:06
Possibly. I made a comment in issue 9425 explaining the particular trick that this here patch makes (using utf-8 as an intermediate form to avoid having to change all the machinery in import.c)
msg115283 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-08-31 22:43
utf-8 codec (in strict mode) rejects surrogates in python3, and so you doesn't support undecodable filenames (filenames decoded using surrogateescape error handler which produces surrogate characters). It may be possible if you use surrogateescape everywhere. Manipulate encoded filenames is not trivial because it may quickly lead to mojibake if the encodings are different (eg. if sys.path contains a bytes filename, you have to be careful). Use utf-8 means that you have to decode and then reencode (to the filesystem encoding) a filename before passing it to a system call (eg. mkdir()). #8611 problem is that Python3 doesn't work if the filesystem is not utf-8. You solution is attractive because it is short, but I prefer to use directly the right solution to not patch Python twice: use unicode (with surrogates, PEP 383, for undecodable filenames) everywhere.
msg115284 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-09-01 01:23
I conffess that I didn't follow the utf-8/surrogate discussion. But the utf-8 encoding can encode all valid unicode characters: UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above. (from wikipedia) If we encounter surrogate halves when encoding (unicode) to utf-8, it means that we are really trying to decode utf-16 and reencode it as utf-8. (and that python is using 16 bits for its unicode chars). the utf--8 codec should be smart enough to merge the surrogates into a utf-32 char, and encode that. Anyway, as you remark, my approach is a _patch_, designed to make python (2.x) work in an unicode environment, with the least amount of code change, for those willing to commit such a patch. In 3.x you may want to do things differently.
msg115329 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-01 19:33
> According to the Unicode standard the high and low surrogate halves used > by UTF-16 (...) Yes, but in Python, U+DC80..D+DCFF range is used to store undecodable bytes. Eg. 'abc\xff'.decode('ascii', 'surrogateescape') gives 'abc\udcff'. > Anyway, as you remark, my approach is a _patch_, designed to make python > (2.x) work in an unicode environment, with the least amount of code > change, for those willing to commit such a patch. Python 2.7 is out and I think it is too late to fix Python2. Anyway, Python2 uses bytes for sys.path or other paths, so the problem only occurs if the user specifies unicode paths. > In 3.x you may want to do things differently. I choosed to rewrite the C code to manipulate unicode paths instead of byte paths => #9425
msg115354 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-09-02 01:37
> Yes, but in Python, U+DC80..D+DCFF range is used to store undecodable bytes. > Eg. 'abc\xff'.decode('ascii', 'surrogateescape') gives 'abc\udcff'. That's an inventive way of breaking the unicode standard :) Anyway, why would you worry about that? My patch doesn't use "surrogateescape" so there is no problem. There are only two places where I "decode": 1) module names and sys.path components in the system file encoding: If they contain undecodable characters, then that is an error. No reason to propagate that error into the import machinery. 2) when decoding utf-8 back into unicode, but that utf-8 is already leagal since _we_ generated it. If a _unicode_ input (sys.path) contains a valid surrogate pair, then the utf-8 encoder just encodes it. But if it finds a lone surrogate as you describe (python special) then that represends an undecodable chacater, something that should have been covered earlier and something we know nothing about. Clearly, that makes that particular unicode sys.path component invalid. (Hm, I notice that 2.7 happily encodes lone surrogates to utf-8) > Python 2.7 is out and I think it is too late to fix Python2. Anyway, Python2 > uses bytes for sys.path or other paths, so the problem only occurs if the user > specifies unicode paths. Which is precisely the case that it is designed to solve. When the chinese user installs EVE Online in a weird folder, then that should work. Also, 2.x is not quite dead yet. There are quite a few people doing their own patches for their private purposes. Although my patch won't go into any official version, there might be others in the same situation like us: Trying to support an _embedded_ python 2.x version in an internationalized enverionment (on windows :)
msg115553 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-04 01:00
Oh, I didn't see that the issue was specific to Python2. I updated the issue's title. If I understood correctly, the issue is also specific to Windows. Do you know if your patch changes the public API? (break the compatibility) -- FYI about Python3: > That's an inventive way of breaking the unicode standard :) It is described in the PEP 383 and it does solve a real and common issue: store a filename that cannot be decoded with the filesystem encoding. The operation is reversible. In Python 3.2, there are os.fsdecode() and os.fsencode() functions. On UNIX/BSD, os.encode(os.fsdecode(x)) is x, if x is a bytes object. The PEP 383 introduces the surrogateescape error handler which does create surrogates on decode, and convert back surrogates to bytes on encode. > Anyway, why would you worry about that? My patch doesn't use > "surrogateescape" so there is no problem. In Python3, filenames are stored as unicode. On UNIX/BSD, if a filename cannot be decode, it is encoded with surrogates. To get a full unicode support in Python3, you have to support surrogates.
msg115575 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-09-04 14:44
As this was never meant for inclusion in Python, and apparently confuses people, I'm closing it - it couldn't go into 2.x, anyway.
msg115683 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-09-06 01:47
Well, it was, originally, but it met with so little interest that I couldn't be bothered to polish it to inclusion standards. Anyway, there was the incompatibility problem of what to do with the __file__ attribute, and the fact that the patch was Windows only. Do we have a place where we can put in working patches for people to use at their own risk, without going through all the hoops of a successful python.org checkin?
msg115686 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-09-06 01:58
There is no such place that I know of, sorry.
msg115691 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-09-06 06:25
Having patches in the tracker is fine to me. Even if the patch is closed, it's still available. Of course, there are many ways to publish code on the net: you could post the patch to Rietveld, to the Python wiki, or publish an entire clone to bitbucket.
msg119109 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-10-19 02:21
FYI, I finished my work on non-ascii filenames in Python 3.2 (#8611, #9425): Python 3.2 now suports any filename with any locale (filesystem) encoding.

History
Date	User	Action	Args
2022-04-11 14:56:20	admin	set	github: 43943
2010-10-19 02:21:59	vstinner	set	messages: + msg119109
2010-09-06 06:25:25	loewis	set	messages: + msg115691
2010-09-06 01:58:21	eric.araujo	set	messages: + msg115686 stage: resolved
2010-09-06 01:47:38	kristjan.jonsson	set	messages: + msg115683
2010-09-04 14:44:13	loewis	set	status: open -> closed resolution: out of date messages: + msg115575
2010-09-04 01:00:39	vstinner	set	messages: + msg115553 title: Unicode Imports -> [Python2] Use utf-8 in the import machinery on Windows to support unicode paths
2010-09-02 01:37:12	kristjan.jonsson	set	messages: + msg115354
2010-09-01 19:34:10	eric.araujo	set	nosy: + eric.araujo
2010-09-01 19:33:12	vstinner	set	messages: + msg115329
2010-09-01 01:23:13	kristjan.jonsson	set	messages: + msg115284
2010-08-31 22:43:50	vstinner	set	messages: + msg115283
2010-08-24 21:06:03	kristjan.jonsson	set	messages: + msg114830
2010-08-24 20:38:54	vstinner	set	nosy: + vstinner messages: + msg114820
2010-08-24 20:29:43	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg114818
2009-04-01 18:41:18	brett.cannon	set	assignee: brett.cannon ->
2009-02-11 11:31:20	theller	set	nosy: + theller
2009-02-11 10:33:17	kristjan.jonsson	set	messages: + msg81636
2009-02-11 03:13:12	ajaksu2	set	assignee: brett.cannon nosy: + brett.cannon, ezio.melotti
2006-09-05 18:11:31	kristjan.jonsson	create