Message 150066 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	poolie
Recipients	benjamin.peterson, gz, pitrou, poolie, r.david.murray, vila, vstinner
Date	2011-12-22.01:50:35
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<CAA9uavB66=DuLwwfY_69=FfaU9dRo453v8pLnft78pn79A7s4g@mail.gmail.com>
In-reply-to	<4EF2893E.1000403@haypocalc.com>

Content
On 22 December 2011 12:32, STINNER Victor <report@bugs.python.org> wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > On 22/12/2011 02:16, Martin Pool wrote: >> The proposal is that in some cases where Python currently assumes >> filenames are ascii on Linux, it ought to instead assume they are >> utf-8. > > Oh, I expected a use case describing the problem, not the proposed > solution :-) The problem as I see it is this: On Linux, filenames are generally (but not always) in UTF-8; people fairly commonly end up with no locale configured, which causes Python to decode filenames as ascii. It is easy for this to end up with them hitting UnicodeErrors. >>> You want to use UTF-8 instead of ASCII, so what? What do you >>> want to do with your nicely well decoded filenames? You cannot print it >>> to your terminal nor pass it to a subprocess, because your terminal uses >>> ASCII, as subprocess. I don't see how it would help you. >> >> When the application has a unicode string, > > Where does this string come from? (It is an important question). It comes, for example, from the name of a file, or a directory, or the contents of a symlink. Or the problem applies equally when the program has got a unicode string (for example off the network in a defined encoding) and it is trying to use it to access the filesystem. > If your locale encoding is ASCII, you cannot write such non-ASCII > filenames using the keyboard for example. Sure you can. The user could enter a backslash-escaped name, which the program knows to decode to unicode. The point is the program has a choice of how it deals with user input, whereas it does not have as much control in Python of how filenames are encoded. > > with working around this when the filenames really are > > valid in what should be the user's locale, > > On your computer, UTF-8 is maybe a good candidate for "what should be > the user's locale", but you cannot generalize for all computers. > > I also wanted to force UTF-8 everywhere, but you cannot do that or your > program will just not work in some configurations. Just to be clear, I'm not proposing to force UTF-8 everywhere. I am only proposing to 'break' the case where the user has non-ascii filenames but, intentionally or not, a locale that specifies only ascii is used. With this change, Python will try to decode them as utf-8, and fail if they're not utf-8. I am coming to think the best step here is just for the OS to do more to make sure the application does get the appropriate locale. (For example, Ubuntu in recent releases uses a pam hook to set LANG for cron jobs, to avoid the example described above.)

On 22 December 2011 12:32, STINNER Victor <report@bugs.python.org> wrote:
>
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
> On 22/12/2011 02:16, Martin Pool wrote:
>> The proposal is that in some cases where Python currently assumes
>> filenames are ascii on Linux, it ought to instead assume they are
>> utf-8.
>
> Oh, I expected a use case describing the problem, not the proposed
> solution :-)

The problem as I see it is this:

On Linux, filenames are generally (but not always) in UTF-8; people
fairly commonly end up with no locale configured, which causes Python
to decode filenames as ascii.  It is easy for this to end up with them
hitting UnicodeErrors.

>>> You want to use UTF-8 instead of ASCII, so what? What do you
>>> want to do with your nicely well decoded filenames? You cannot print it
>>> to your terminal nor pass it to a subprocess, because your terminal uses
>>> ASCII, as subprocess. I don't see how it would help you.
>>
>> When the application has a unicode string,
>
> Where does this string come from? (It is an important question).

It comes, for example, from the name of a file, or a directory, or the
contents of a symlink.  Or the problem applies equally when the
program has got a unicode string (for example off the network in a
defined encoding) and it is trying to use it to access the filesystem.

> If your locale encoding is ASCII, you cannot write such non-ASCII
> filenames using the keyboard for example.

Sure you can.  The user could enter a backslash-escaped name, which
the program knows to decode to unicode.  The point is the program has
a choice of how it deals with user input, whereas it does not have as
much control in Python of how filenames are encoded.

>  > with working around this when the filenames really are
>  > valid in what should be the user's locale,
>
> On your computer, UTF-8 is maybe a good candidate for "what should be
> the user's locale", but you cannot generalize for all computers.
>
> I also wanted to force UTF-8 everywhere, but you cannot do that or your
> program will just not work in some configurations.

Just to be clear, I'm not proposing to force UTF-8 everywhere.  I am
only proposing to 'break' the case where the user has non-ascii
filenames but, intentionally or not, a locale that specifies only
ascii is used.  With this change, Python will try to decode them as
utf-8, and fail if they're not utf-8.

I am coming to think the best step here is just for the OS to do more
to make sure the application does get the appropriate locale.  (For
example, Ubuntu in recent releases uses a pam hook to set LANG for
cron jobs, to avoid the example described above.)

History
Date	User	Action	Args
2011-12-22 01:50:36	poolie	set	recipients: + poolie, pitrou, vstinner, vila, benjamin.peterson, r.david.murray, gz
2011-12-22 01:50:35	poolie	link	issue13643 messages
2011-12-22 01:50:35	poolie	create