Message 125819 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	ingemar, r.david.murray, terry.reedy, vstinner
Date	2011-01-09.02:40:20
SpamBayes Score	6.6275874e-12
Marked as misclassified	No
Message-id	<1294540825.33.0.360294796512.issue10828@psf.upfronthosting.co.za>
In-reply-to

Content
> ANSI code page: cp1252 ...os.fsencode('ä') => b'\xe4' Hum, I ran your example with a debugger, and ok, I now remember the whole thing. I fixed Python to support non-ASCII characters (... only non-ASCII characters encodable to the ANSI code page for Windows) in the search path, not in the module name. The import machinery encodes each search path to the filesystem encoding, but it encodes the module name to UTF-8. Concatenate two byte strings encoded to different encodings doesn't work (it leads to mojibake). To fix this problem, there are two solutions: a) encode the module name to the fileystem encoding b) manipulate paths as unicode strings; to access the filesystem: use the wide character (unicode) API of Windows and encode paths to the filesystem encoding on UNIX/BSD It is easier to implement (a) than (b), but (a) only gives you the support of paths and module names encodable to the ANSI code page. (b) gives you the full unicode support because it never encodes paths to the filesystem encoding, but it may decodes paths from the filesystem encoding. Encode a path raises a UnicodeEncodeError on the first character not encodable to the ANSI code page, whereas decode a path never fails (except if the user manually changed its code page to a rare ANSI code page like UTF-8). I implemented (b) in my import_unicode SVN branch, but as I wrote, I still have some work to merge this branch into py3k, and anyway I will wait for Python 3.3.

> ANSI code page: cp1252 ...os.fsencode('ä') => b'\xe4'

Hum, I ran your example with a debugger, and ok, I now remember the whole thing.

I fixed Python to support non-ASCII characters (... only non-ASCII characters encodable to the ANSI code page for Windows) in the *search path*, not in the module name.

The import machinery encodes each search path to the filesystem encoding, but it encodes the module name to UTF-8. Concatenate two byte strings encoded to different encodings doesn't work (it leads to mojibake).

To fix this problem, there are two solutions:

 a) encode the module name to the fileystem encoding
 b) manipulate paths as unicode strings; to access the filesystem: use the wide character (unicode) API of Windows and encode paths to the filesystem encoding on UNIX/BSD

It is easier to implement (a) than (b), but (a) only gives you the support of paths and module names encodable to the ANSI code page.

(b) gives you the full unicode support because it never *encodes* paths to the filesystem encoding, but it may *decodes* paths from the filesystem encoding. Encode a path raises a UnicodeEncodeError on the first character not encodable to the ANSI code page, whereas decode a path never fails (except if the user manually changed its code page to a rare ANSI code page like UTF-8).

I implemented (b) in my import_unicode SVN branch, but as I wrote, I still have some work to merge this branch into py3k, and anyway I will wait for Python 3.3.

History
Date	User	Action	Args
2011-01-09 02:40:25	vstinner	set	recipients: + vstinner, terry.reedy, r.david.murray, ingemar
2011-01-09 02:40:25	vstinner	set	messageid: <1294540825.33.0.360294796512.issue10828@psf.upfronthosting.co.za>
2011-01-09 02:40:20	vstinner	link	issue10828 messages
2011-01-09 02:40:20	vstinner	create