Author poolie
Recipients benjamin.peterson, gz, pitrou, poolie, r.david.murray, vila, vstinner
Date 2011-12-21.23:02:30
SpamBayes Score 2.13163e-14
Marked as misclassified No
Message-id <CAA9uavDwz1NR1NRiviJBUdS6d+N7YyrnkKUYg_6-9oVeX-K06g@mail.gmail.com>
In-reply-to <1324431645.611.2.camel@localhost.localdomain>
Content
On 21 December 2011 12:41, Antoine Pitrou <report@bugs.python.org> wrote:
>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
>> The standard encoding is UTF-8.
>
> How so? I don't know of any Linux or Unix spec which says so. If you get
> the Linux heads to standardize this then I'll certainly be very happy
> (and countless others will, too). But AFAIK this it not the case and I
> don't see why you are asking Python to make a choice that OS vendors
> refuse to make. You are certainly asking the wrong project to solve this
> problem.

It is a de facto, not de jure standard: UTF-8 is how things are
typically stored.  Other software (eg gnome file handling utilities)
makes this assumption.  See eg
<http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>.

I would be happy to see an authoritative document saying this is how
things _should_ be stored, but I can't find one yet.  But in Unix
there are no ultimate authorities: even if someone announced filenames
are utf-8 there will obviously continue to be many machines where in
practice they are not.

I started asking about it over here, to see if at least Ubuntu can
have an opinion that this is how things should normally be:
https://lists.ubuntu.com/archives/ubuntu-devel/2011-December/034588.html

I'm not sure what you expect a technical solution at the OS level
would look like.  The api is 8-bit strings and that's not likely to
change.  It's possible to have a situation where no locale is
specified.  Applications unavoidably need to have some opinion about
what to do there.  Other applications assume the filenames are utf-8.
Python assumes that text in general will be UTF-8
(getdefaultencoding).

It is almost like your caricature of OS developers as being
anglocentric, but in fact here it's Python that assumes everything is
probably ascii - or more charitably, it is just assuming that failing
when things aren't ascii is the best tradeoff.  Maybe it is.

One OS-level fix is to try to reduce the number of situations where
people see no locale, or the C locale, and give them C.UTF-8 instead.
That is probably worth doing.  But having no locale can still happen,
and I think Python could handle that better, so the changes are
complimentary.
History
Date User Action Args
2011-12-21 23:02:31pooliesetrecipients: + poolie, pitrou, vstinner, vila, benjamin.peterson, r.david.murray, gz
2011-12-21 23:02:31poolielinkissue13643 messages
2011-12-21 23:02:30pooliecreate