Author vstinner
Recipients benjamin.peterson, gz, pitrou, poolie, r.david.murray, vila, vstinner
Date 2011-12-22.02:15:55
SpamBayes Score 0.0
Marked as misclassified No
Message-id <4EF2935A.8080502@haypocalc.com>
In-reply-to <CAA9uavB66=DuLwwfY_69=FfaU9dRo453v8pLnft78pn79A7s4g@mail.gmail.com>
Content
> The problem as I see it is this:
>
> On Linux, filenames are generally (but not always) in UTF-8; people
> fairly commonly end up with no locale configured, which causes Python
> to decode filenames as ascii.  It is easy for this to end up with them
> hitting UnicodeErrors.

I don't think that your problem is decoding, but encoding filenames.

>> Where does this string come from? (It is an important question).
>
> It comes, for example, from the name of a file, or a directory, or the
> contents of a symlink.

For all these cases, Python is able to decode them (but store 
undecodable bytes as surrogates, PEP 383).

> Or the problem applies equally when the
> program has got a unicode string (for example off the network in a
> defined encoding) and it is trying to use it to access the filesystem.

Hum, you can have the problem if you try to decompress a ZIP containing 
a Unicode filename. ZIP stores filenames are cp437 or UTF-8 depending on 
a flag (well, it's not exact: some buggy tools store filenames as a 
different encoding, the Windows ANSI code page...). If you try to 
decompress a ZIP containg non-ASCII filenames stored as UTF-8, whereas 
your locale encoding is ASCII, you will get a UnicodeEncodeError.

I would suggest to fix your environment: if you want to play with 
non-ASCII filenames, you should first fix your locale. Or other programs 
will also fail because of your locale.

(There is maybe something to do in the ZIP module to allow to create 
file names using the original raw bytes filename. See also issues #10614 
and #10972.)

>> If your locale encoding is ASCII, you cannot write such non-ASCII
>> filenames using the keyboard for example.
>
> Sure you can.  The user could enter a backslash-escaped name, which
> the program knows to decode to unicode.

How exactly? Users do usually not write backslash-escaped name. Users 
prefer to click on icons :-)

 > with user input, whereas it does not have as
> much control in Python of how filenames are encoded.

Ah? The application *can* control how filenames are encoded. Example:

Create a UTF-8 filename with a UTF-8 locale encoding.

$ python3
Python 3.2.1 (default, Jul 11 2011, 18:54:42)
 >>> import locale; print(locale.getpreferredencoding())
UTF-8
 >>> f=open("hé.txt", "w"); f.write("unicode!"); f.close()

Read the file content, even if the locale encoding is ASCII.

$ LANG=C python3
Python 3.2.1 (default, Jul 11 2011, 18:54:42)
 >>> import locale; print(locale.getpreferredencoding())
ANSI_X3.4-1968
 >>> f=open("h\xe9.txt", "r"); print(f.read()); f.close()
Traceback (most recent call last):
   ...
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in 
position 1: ordinal not in range(128)
 >>> f=open("h\xe9.txt".encode("utf-8"), "r"); print(f.read()); f.close()
unicode!

You cannot pass directly "h\xe9.txt", but if you know the "correct" file 
system encoding, you can encode it explicitly using str.encode("utf-8").

You are trying to do something complex (add hacks for filenames, for a 
specific configuration) for a simple problem: configure correctly 
locales. If you know and you are sure that your are using UTF-8, why not 
simply setting your locale to a UTF-8 locale?
History
Date User Action Args
2011-12-22 02:15:56vstinnersetrecipients: + vstinner, pitrou, vila, benjamin.peterson, r.david.murray, gz, poolie
2011-12-22 02:15:55vstinnerlinkissue13643 messages
2011-12-22 02:15:55vstinnercreate