classification
Title: locale.getpreferredencoding() must not set temporary LC_CTYPE
Type: behavior Stage: resolved
Components: IO, Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, haypo, lemburg, loewis, pitrou, python-dev, r.david.murray, sdaoden
Priority: normal Keywords: patch

Created on 2011-01-27 11:00 by sdaoden, last changed 2012-07-08 10:08 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
io_dont_set_locale.patch haypo, 2011-02-01 00:03
Messages (17)
msg127177 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-27 11:00
This bug may be based on same problem as Issue 6203.
- My system locale is en_GB.UTF-8.
- Given a latin1 text file, open()+ will fail with
  'UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6...'
- Using locale.setlocale(..., ...)
- Re-open causes same error, I/O layer codec has not been changed!
- Using os.environ["LC_ALL"] = ...
- Re-open works properly, I/O layer codec has been changed.
P.S.: i am new to Python, please don't assume i can help in solving the problem!
msg127178 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-27 11:19
> - Using locale.setlocale(..., ...)
> - Re-open causes same error, I/O layer codec has not been changed!

Yes, this is the expected behaviour with the current code.

TextIOWrapper uses indirectly locale.getpreferredencoding() to choose your file encoding. If locale has the CODESET constant, this function sets LC_CTYPE to "" and uses nl_langinfo(CODESET) to get the locale encoding.

locale.getpreferredencoding() has an option to not set the LC_CTYPE to "": locale.getpreferredencoding(False).

Example:
---------------------------
$ python3.1
Type "help", "copyright", "credits" or "license" for more information.
>>> from locale import getpreferredencoding, setlocale, LC_CTYPE
>>> from locale import nl_langinfo, CODESET

>>> setlocale(LC_CTYPE, None)
'fr_FR.utf8'
>>> getpreferredencoding()
'UTF-8'
>>> getpreferredencoding(False)
'UTF-8'

>>> setlocale(LC_CTYPE, 'fr_FR.iso88591')
'fr_FR.iso88591'
>>> nl_langinfo(CODESET)
'ISO-8859-1'
>>> getpreferredencoding()
'UTF-8'
>>> getpreferredencoding(False)
'ISO-8859-1'
---------------------------

Setting LC_CTYPE does change directly nl_langinfo(CODESET) result, but not getpreferredencoding() result because getpreferredencoding() doesn't care of the current locale: it uses its own LC_CTYPE value ("").

getpreferredencoding(False) uses the current locale and give the expected result.

> - Using os.environ["LC_ALL"] = ...
> - Re-open works properly, I/O layer codec has been changed.

Set LC_ALL works because getpreferredencoding() sets the LC_CTYPE to "" which will read the current value of the "LC_ALL" and "LC_CTYPE" environment variables.

--

Actually, TextIOWrapper doesn't use the current locale, it only uses (indirectly) the environment variables. I don't know which behaviour is better.

If you would like that TextIOWrapper uses your current locale, use: open(filename, encoding=locale.getpreferredencoding(True)).

Anyway, I don't know understand why do you change your locale, because you know that your file encoding is Latin1. Why don't you use directly: open(filename, encoding='latin1')?
msg127179 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-27 11:24
> This bug may be based on same problem as Issue 6203.

Nope, both issues are different. Here you want that TextIOWrapper reads your current locale, and not your environment variables. Issue #6203 asks why LC_CTYPE is not C by default, but the user locale LC_CTYPE (read from LC_ALL or LC_CTYPE environment variables).
msg127188 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-27 13:47
> Anyway, I don't know understand why do you change your locale,
> because you know that your file encoding is Latin1. Why don't you
> use directly: open(filename, encoding='latin1')?

Fortunately Issue 9124 is being solved soon due to the very active
happy hacker haypo ...
I have read haypo's add-on to Issue 6203 and, since he refers to
this issue here, i'll add some thoughts of mine, though they possibly
should not belong into a bug tracker ...

My misunderstanding was based upon an old project of mine,
where i've used the environment to initialize the library state
upon program startup only, but afterwards the entire handling was centralized upon some "Locale" class (changes therein were dispatched
to and thus reflected by a TextCodec etc. - you may see Issue 9727,
though my solution was hardwired).
Like that turning one screw managed the entire system.

If Python would be my project, i would change this code,
because i do not see a real difference in os.environ[LC_]=
and locale.setlocale(LC_,)!
Both cases indicate the users desire to change a specific locale
setting and thus - of course - all the changes which that implies!
So why should there be a difference?

What i really have to say is that the (3.1) implementation of getpreferredencoding() is horror, not only in respect to SMP
(it's a no-go, then, even with locking, but that's not present).
If Python would be mine (after thinking one hour without any
feedback of anybody else), i would do the following:
- upon program startup, init LibC environment:
  setlocale(LC_ALL, "");
  (see <http://pubs.opengroup.org/onlinepubs/009695399/functions/setlocale.html>)
  Then init this very basic codeset in an unthreaded state:
  global_very_default_codeset = nl_langinfo(CODESET);
  After that, rename this terrible "do_setlocale" argument to
  "use_locale_active_on_program_startup".
  Then i would start a discussion wether such an argument is useful at
  all, because you possibly always ever say "False".
  Do ya???
msg127209 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-27 18:50
> Both cases indicate the users desire to change a specific locale
> setting and thus - of course - all the changes which that implies!
> So why should there be a difference?

I don't think it's intentional. I would be +1 on changing to getpreferredencoding(False).
msg127214 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-01-27 19:31
>> Both cases indicate the users desire to change a specific locale
>> setting and thus - of course - all the changes which that implies!
>> So why should there be a difference?
> 
> I don't think it's intentional. I would be +1 on changing to getpreferredencoding(False).

That won't actually work. If you always use the C library's locale
setting, most scripts will run in the C locale, and fail to read text
files properly.
msg127231 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-27 22:22
Set version to 3.3, I think that it is too late to change such critical code in Python 3.2.
msg127233 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-27 22:32
> upon program startup, init LibC environment: setlocale(LC_ALL, "");

Python 3 does something like that: Py_InitializeEx() calls setlocale(LC_CTYPE, ""). But I (and others) consider that as a bug (see #6203 discussion): Python should not do that (nor any library) implicitly, but a *program* can do that (once) at startup (explicitly).
msg127234 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-27 22:35
> >> Both cases indicate the users desire to change a specific locale
> >> setting and thus - of course - all the changes which that implies!
> >> So why should there be a difference?
> > 
> > I don't think it's intentional. I would be +1 on changing to getpreferredencoding(False).
> 
> That won't actually work. If you always use the C library's locale
> setting, most scripts will run in the C locale, and fail to read text
> files properly.

Well, is it any different from today? That's an innocent question: I
don't know if there's a difference between "C locale" and "empty
locale".
msg127263 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-01-28 09:29
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> upon program startup, init LibC environment: setlocale(LC_ALL, "");
> 
> Python 3 does something like that: Py_InitializeEx() calls setlocale(LC_CTYPE, ""). But I (and others) consider that as a bug (see #6203 discussion): Python should not do that (nor any library) implicitly, but a *program* can do that (once) at startup (explicitly).

Agreed. See the discussion on the ticket for more details.

setlocale() should only be called by applications, not by libraries.
For Python this means: calling it in main() is fine, but not
in Py_InitializeEx().
msg127416 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-29 13:50
Also in respect to Issue 6203 i could talk about a project which did not link against anything in the end, only ld(1) and syscalls and the undocumented third 'char **envp' arg to UNIX main()s.
Thus: all of you should be *very* happy about the warm and cosy environment of LibC etc.!
You've decided to re-Python as Py3k, is guess it has got something to do with, let me describe it as, UNICODE.
Thus: you need a locale.

- Environment: has an encoding, though keys are ok to parse in ASCII
  (unless your OS allows wide characters *optionally*).
  Still, LC_ values may be specified in a *lot* of different ways,
  but one thing is true: it's a hard to do in plain C without being
  able to use stuff which *may* depend upon an initialized library
- Path names: have an encoding
- Console I/O: has an encoding
- File I/O: this is all dumb bytes, just do what you want

Conclusion: you need a locale.

- Hardcode defaults
- Spread specific things all across the implementation.
  I.e., in path access, use some os.path._sysdep.default_codeset(),
  in console I/O do os.console._sysdep.default_codeset() etc.
  (i'm lying about names)
- Perform an initial global initialization

So - what are you all talking about?
Noone - and i really mean NOONE - can assume that a fully blown environment like python(1) can be used as an isolated sandbox thing
like ECMAScript!  File I/O, child processes ...  Shall an entire interpreter lifecycle be possible in a signal(3) handler
(uuhh, just kiddin')?  Even if that would be true for 2.7 (don't know), in Py3k there is graceful and neatless UNICODE support.
You need a locale.

I would indeed insist on the following:
- The interpreter *has* to be initialized in the cosy LibC
  (or whatever native thing) environment.
  Like this it embeds itself neatlessly in there.
  This *has* to be performed in an *unthreaded* state.
  If you are really concerned about anything here,
  add an additional argument (or is it there yet?  I did *not*
  look in there - i would/will need long months to get an idea
  of the entire python(1) system) to your interpreter's setup()
  like thing, or allow NULL to nevertheless use setlocale() directly.
  Like this the embedder can choose herself which approach she
  wants to adhere.
- Even if 3.DID_IT ends up with a lot of 'encoding=STRING' instead
  of 'codec=None' (aka 'codec=codec_instance'), i would implement
  the system in a way that a change at a single place is automatically
  reflected all through the system (on a no-arg-then-use-default)
  base.

After the end:
someone who earned about 150 bucks from me for two books i bought
almost a decade ago once i've started Thinking In ... programming
said some years ago (as i've read in the german magazine c't):
    "In Python i am even more productive than with Java."
(I always was in doubt about that person - someone who is productive
in Java, who may that be?)
Thanks for python(1), and have a nice weekend.
msg127675 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-02-01 00:03
Attached patch replaces locale.getpreferredencoding() by locale.getpreferredencoding(False) in _io.TextIOWrapper and _pyio.TextIOWrapper.
msg127679 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-02-01 02:43
Steffan: I'm not sure what your post means, but I think there is a chance you might be confused about something.  Python should *never* change the locale from the C locale.  A Python *program* can do so, by calling setlocale, but Python itself should not.  This is because when an arbitrary Python program is run, it needs to run in the C locale *unless it chooses otherwise*.  To do anything else would produce a myriad portability problems for any code that is affected by locale settings (especially when the programmer doesn't know that it is so affected).

This is orthogonal to the issue of deciding what encoding to use for various bits of I/O, where Python may need to discover what locale the user has chosen as a default.  It's too bad libc makes this so hard to do safely.
msg127693 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-02-01 11:58
Most of this is much too loud for a newbie who is about to read PEP 7 anyway.  And if this community has chosen to try (?!?) not to break compatibility with code which does not have a notion of a locale setting (i.e. naively uses other code in that spirit), you know, then this is simply the way it is.  Thus: you're right.  I do agree with what you say, we here have a (8-bit) C++ library which does this in it's setup():

        // Initialize those Locale variables we're responsible for
        Locale::_ctype_cclass = Locale::_posix_cclass;
        Locale::_ctype_ccase = Locale::_posix_ccase;

(Like i said: we here went completely grazy and avoid system libraries whenever possible and at least directly, doing the stuff ourselfs and only with syscalls.)

Besides that i would agree with me that unthreaded init, optional embeddor locale argument, cleanup of .getprefer...() and other drops of setlocale() are/would be good design decisions.  And of course: "keeping the thing simple and understandable" is a thing to keep in mind in respect to a normal user.

After the end (i have to excuse myself once again for a book):
I, f.e., opened an issue 11059 on saturday because the HG repo was (2.7 may still be) not cloneable, and i did so at selenic, too.  Notes on that:
- pitrou closed it because this tracker is of course for Python bugs.   (I asked him to decide - thanks.)
- The selenic people told me that i added my trace to a completely wrong issue.  (Just searched - that's more than shown in trace dump.)
- I've found out that many, *many* issues seem to have been created due to this repo failure at python.org (at selenic), and i've added a note that they possibly should include a prominent notice that people should look for "most recent call last" before creating a new one.  (I guess that most of these people are programmers - who else uses HG?)
- Conclusion: maybe even os.environ[]= == locale.setlocale() is not simple minded enough.
msg162329 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-06-05 05:37
I think it's absolutely necessary that text files, by default, are opened in the encoding of the user's locale, whether the script has called setlocale or not.

There are reasons for C to not automatically call setlocale at startup (mostly backwards compatibility), but they don't apply to Python.
msg162339 - (view) Author: Roundup Robot (python-dev) Date: 2012-06-05 11:48
New changeset 2587328c7c9c by Victor Stinner in branch 'default':
Close #11022: TextIOWrapper doesn't call locale.setlocale() anymore
http://hg.python.org/cpython/rev/2587328c7c9c
msg164989 - (view) Author: Roundup Robot (python-dev) Date: 2012-07-08 10:08
New changeset 6651c932d014 by Florent Xicluna in branch 'default':
Issue #11022 and #15287: correctly remove the TESTFN file in test_builtin.
http://hg.python.org/cpython/rev/6651c932d014
History
Date User Action Args
2012-07-08 10:08:58python-devsetmessages: + msg164989
2012-06-05 11:48:17python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg162339

resolution: fixed
stage: patch review -> resolved
2012-06-05 05:37:22loewissetmessages: + msg162329
2012-06-04 23:47:22hayposettitle: locale.setlocale() doesn't change I/O codec, os.environ does -> locale.getpreferredencoding() must not set temporary LC_CTYPE
2011-02-01 11:58:24sdaodensetnosy: lemburg, loewis, pitrou, haypo, Arfrever, r.david.murray, sdaoden
messages: + msg127693
2011-02-01 02:43:23r.david.murraysetnosy: + r.david.murray
messages: + msg127679
2011-02-01 00:05:46pitrousetnosy: lemburg, loewis, pitrou, haypo, Arfrever, sdaoden
stage: patch review
2011-02-01 00:03:27hayposetfiles: + io_dont_set_locale.patch

messages: + msg127675
keywords: + patch
nosy: lemburg, loewis, pitrou, haypo, Arfrever, sdaoden
2011-01-29 13:50:25sdaodensetnosy: lemburg, loewis, pitrou, haypo, Arfrever, sdaoden
messages: + msg127416
2011-01-28 15:01:13Arfreversetnosy: lemburg, loewis, pitrou, haypo, Arfrever, sdaoden
title: locale.setlocale() doesn't change I/O codec, os.environ -> locale.setlocale() doesn't change I/O codec, os.environ does
2011-01-28 09:29:33lemburgsetnosy: + lemburg
title: locale.setlocale() doesn't change I/O codec, os.environ -> locale.setlocale() doesn't change I/O codec, os.environ
messages: + msg127263
2011-01-27 22:35:50pitrousetnosy: loewis, pitrou, haypo, Arfrever, sdaoden
messages: + msg127234
title: locale.setlocale() doesn't change I/O codec, os.environ -> locale.setlocale() doesn't change I/O codec, os.environ
2011-01-27 22:32:55hayposetnosy: loewis, pitrou, haypo, Arfrever, sdaoden
messages: + msg127233
2011-01-27 22:22:31hayposetnosy: loewis, pitrou, haypo, Arfrever, sdaoden
messages: + msg127231
versions: + Python 3.3, - Python 3.1, Python 3.2
2011-01-27 19:31:02loewissetnosy: loewis, pitrou, haypo, Arfrever, sdaoden
messages: + msg127214
title: locale.setlocale() doesn't change I/O codec, os.environ[] does -> locale.setlocale() doesn't change I/O codec, os.environ
2011-01-27 18:50:21pitrousetversions: + Python 3.2
nosy: + loewis, pitrou

messages: + msg127209

components: + IO
2011-01-27 16:58:48Arfreversetnosy: + Arfrever
2011-01-27 13:47:22sdaodensetmessages: + msg127188
2011-01-27 11:24:41hayposetmessages: + msg127179
2011-01-27 11:19:27hayposetnosy: + haypo
messages: + msg127178
2011-01-27 11:01:17sdaodensettype: behavior
2011-01-27 11:00:23sdaodencreate