msg127177 - (view) |
Author: Steffen Daode Nurpmeso (sdaoden) |
Date: 2011-01-27 11:00 |
This bug may be based on same problem as Issue 6203.
- My system locale is en_GB.UTF-8.
- Given a latin1 text file, open()+ will fail with
'UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6...'
- Using locale.setlocale(..., ...)
- Re-open causes same error, I/O layer codec has not been changed!
- Using os.environ["LC_ALL"] = ...
- Re-open works properly, I/O layer codec has been changed.
P.S.: i am new to Python, please don't assume i can help in solving the problem!
|
msg127178 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2011-01-27 11:19 |
> - Using locale.setlocale(..., ...)
> - Re-open causes same error, I/O layer codec has not been changed!
Yes, this is the expected behaviour with the current code.
TextIOWrapper uses indirectly locale.getpreferredencoding() to choose your file encoding. If locale has the CODESET constant, this function sets LC_CTYPE to "" and uses nl_langinfo(CODESET) to get the locale encoding.
locale.getpreferredencoding() has an option to not set the LC_CTYPE to "": locale.getpreferredencoding(False).
Example:
---------------------------
$ python3.1
Type "help", "copyright", "credits" or "license" for more information.
>>> from locale import getpreferredencoding, setlocale, LC_CTYPE
>>> from locale import nl_langinfo, CODESET
>>> setlocale(LC_CTYPE, None)
'fr_FR.utf8'
>>> getpreferredencoding()
'UTF-8'
>>> getpreferredencoding(False)
'UTF-8'
>>> setlocale(LC_CTYPE, 'fr_FR.iso88591')
'fr_FR.iso88591'
>>> nl_langinfo(CODESET)
'ISO-8859-1'
>>> getpreferredencoding()
'UTF-8'
>>> getpreferredencoding(False)
'ISO-8859-1'
---------------------------
Setting LC_CTYPE does change directly nl_langinfo(CODESET) result, but not getpreferredencoding() result because getpreferredencoding() doesn't care of the current locale: it uses its own LC_CTYPE value ("").
getpreferredencoding(False) uses the current locale and give the expected result.
> - Using os.environ["LC_ALL"] = ...
> - Re-open works properly, I/O layer codec has been changed.
Set LC_ALL works because getpreferredencoding() sets the LC_CTYPE to "" which will read the current value of the "LC_ALL" and "LC_CTYPE" environment variables.
--
Actually, TextIOWrapper doesn't use the current locale, it only uses (indirectly) the environment variables. I don't know which behaviour is better.
If you would like that TextIOWrapper uses your current locale, use: open(filename, encoding=locale.getpreferredencoding(True)).
Anyway, I don't know understand why do you change your locale, because you know that your file encoding is Latin1. Why don't you use directly: open(filename, encoding='latin1')?
|
msg127179 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2011-01-27 11:24 |
> This bug may be based on same problem as Issue 6203.
Nope, both issues are different. Here you want that TextIOWrapper reads your current locale, and not your environment variables. Issue #6203 asks why LC_CTYPE is not C by default, but the user locale LC_CTYPE (read from LC_ALL or LC_CTYPE environment variables).
|
msg127188 - (view) |
Author: Steffen Daode Nurpmeso (sdaoden) |
Date: 2011-01-27 13:47 |
> Anyway, I don't know understand why do you change your locale,
> because you know that your file encoding is Latin1. Why don't you
> use directly: open(filename, encoding='latin1')?
Fortunately Issue 9124 is being solved soon due to the very active
happy hacker haypo ...
I have read haypo's add-on to Issue 6203 and, since he refers to
this issue here, i'll add some thoughts of mine, though they possibly
should not belong into a bug tracker ...
My misunderstanding was based upon an old project of mine,
where i've used the environment to initialize the library state
upon program startup only, but afterwards the entire handling was centralized upon some "Locale" class (changes therein were dispatched
to and thus reflected by a TextCodec etc. - you may see Issue 9727,
though my solution was hardwired).
Like that turning one screw managed the entire system.
If Python would be my project, i would change this code,
because i do not see a real difference in os.environ[LC_]=
and locale.setlocale(LC_,)!
Both cases indicate the users desire to change a specific locale
setting and thus - of course - all the changes which that implies!
So why should there be a difference?
What i really have to say is that the (3.1) implementation of getpreferredencoding() is horror, not only in respect to SMP
(it's a no-go, then, even with locking, but that's not present).
If Python would be mine (after thinking one hour without any
feedback of anybody else), i would do the following:
- upon program startup, init LibC environment:
setlocale(LC_ALL, "");
(see <http://pubs.opengroup.org/onlinepubs/009695399/functions/setlocale.html>)
Then init this very basic codeset in an unthreaded state:
global_very_default_codeset = nl_langinfo(CODESET);
After that, rename this terrible "do_setlocale" argument to
"use_locale_active_on_program_startup".
Then i would start a discussion wether such an argument is useful at
all, because you possibly always ever say "False".
Do ya???
|
msg127209 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2011-01-27 18:50 |
> Both cases indicate the users desire to change a specific locale
> setting and thus - of course - all the changes which that implies!
> So why should there be a difference?
I don't think it's intentional. I would be +1 on changing to getpreferredencoding(False).
|
msg127214 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2011-01-27 19:31 |
>> Both cases indicate the users desire to change a specific locale
>> setting and thus - of course - all the changes which that implies!
>> So why should there be a difference?
>
> I don't think it's intentional. I would be +1 on changing to getpreferredencoding(False).
That won't actually work. If you always use the C library's locale
setting, most scripts will run in the C locale, and fail to read text
files properly.
|
msg127231 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2011-01-27 22:22 |
Set version to 3.3, I think that it is too late to change such critical code in Python 3.2.
|
msg127233 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2011-01-27 22:32 |
> upon program startup, init LibC environment: setlocale(LC_ALL, "");
Python 3 does something like that: Py_InitializeEx() calls setlocale(LC_CTYPE, ""). But I (and others) consider that as a bug (see #6203 discussion): Python should not do that (nor any library) implicitly, but a *program* can do that (once) at startup (explicitly).
|
msg127234 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2011-01-27 22:35 |
> >> Both cases indicate the users desire to change a specific locale
> >> setting and thus - of course - all the changes which that implies!
> >> So why should there be a difference?
> >
> > I don't think it's intentional. I would be +1 on changing to getpreferredencoding(False).
>
> That won't actually work. If you always use the C library's locale
> setting, most scripts will run in the C locale, and fail to read text
> files properly.
Well, is it any different from today? That's an innocent question: I
don't know if there's a difference between "C locale" and "empty
locale".
|
msg127263 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2011-01-28 09:29 |
STINNER Victor wrote:
>
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
>> upon program startup, init LibC environment: setlocale(LC_ALL, "");
>
> Python 3 does something like that: Py_InitializeEx() calls setlocale(LC_CTYPE, ""). But I (and others) consider that as a bug (see #6203 discussion): Python should not do that (nor any library) implicitly, but a *program* can do that (once) at startup (explicitly).
Agreed. See the discussion on the ticket for more details.
setlocale() should only be called by applications, not by libraries.
For Python this means: calling it in main() is fine, but not
in Py_InitializeEx().
|
msg127416 - (view) |
Author: Steffen Daode Nurpmeso (sdaoden) |
Date: 2011-01-29 13:50 |
Also in respect to Issue 6203 i could talk about a project which did not link against anything in the end, only ld(1) and syscalls and the undocumented third 'char **envp' arg to UNIX main()s.
Thus: all of you should be *very* happy about the warm and cosy environment of LibC etc.!
You've decided to re-Python as Py3k, is guess it has got something to do with, let me describe it as, UNICODE.
Thus: you need a locale.
- Environment: has an encoding, though keys are ok to parse in ASCII
(unless your OS allows wide characters *optionally*).
Still, LC_ values may be specified in a *lot* of different ways,
but one thing is true: it's a hard to do in plain C without being
able to use stuff which *may* depend upon an initialized library
- Path names: have an encoding
- Console I/O: has an encoding
- File I/O: this is all dumb bytes, just do what you want
Conclusion: you need a locale.
- Hardcode defaults
- Spread specific things all across the implementation.
I.e., in path access, use some os.path._sysdep.default_codeset(),
in console I/O do os.console._sysdep.default_codeset() etc.
(i'm lying about names)
- Perform an initial global initialization
So - what are you all talking about?
Noone - and i really mean NOONE - can assume that a fully blown environment like python(1) can be used as an isolated sandbox thing
like ECMAScript! File I/O, child processes ... Shall an entire interpreter lifecycle be possible in a signal(3) handler
(uuhh, just kiddin')? Even if that would be true for 2.7 (don't know), in Py3k there is graceful and neatless UNICODE support.
You need a locale.
I would indeed insist on the following:
- The interpreter *has* to be initialized in the cosy LibC
(or whatever native thing) environment.
Like this it embeds itself neatlessly in there.
This *has* to be performed in an *unthreaded* state.
If you are really concerned about anything here,
add an additional argument (or is it there yet? I did *not*
look in there - i would/will need long months to get an idea
of the entire python(1) system) to your interpreter's setup()
like thing, or allow NULL to nevertheless use setlocale() directly.
Like this the embedder can choose herself which approach she
wants to adhere.
- Even if 3.DID_IT ends up with a lot of 'encoding=STRING' instead
of 'codec=None' (aka 'codec=codec_instance'), i would implement
the system in a way that a change at a single place is automatically
reflected all through the system (on a no-arg-then-use-default)
base.
After the end:
someone who earned about 150 bucks from me for two books i bought
almost a decade ago once i've started Thinking In ... programming
said some years ago (as i've read in the german magazine c't):
"In Python i am even more productive than with Java."
(I always was in doubt about that person - someone who is productive
in Java, who may that be?)
Thanks for python(1), and have a nice weekend.
|
msg127675 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2011-02-01 00:03 |
Attached patch replaces locale.getpreferredencoding() by locale.getpreferredencoding(False) in _io.TextIOWrapper and _pyio.TextIOWrapper.
|
msg127679 - (view) |
Author: R. David Murray (r.david.murray) * |
Date: 2011-02-01 02:43 |
Steffan: I'm not sure what your post means, but I think there is a chance you might be confused about something. Python should *never* change the locale from the C locale. A Python *program* can do so, by calling setlocale, but Python itself should not. This is because when an arbitrary Python program is run, it needs to run in the C locale *unless it chooses otherwise*. To do anything else would produce a myriad portability problems for any code that is affected by locale settings (especially when the programmer doesn't know that it is so affected).
This is orthogonal to the issue of deciding what encoding to use for various bits of I/O, where Python may need to discover what locale the user has chosen as a default. It's too bad libc makes this so hard to do safely.
|
msg127693 - (view) |
Author: Steffen Daode Nurpmeso (sdaoden) |
Date: 2011-02-01 11:58 |
Most of this is much too loud for a newbie who is about to read PEP 7 anyway. And if this community has chosen to try (?!?) not to break compatibility with code which does not have a notion of a locale setting (i.e. naively uses other code in that spirit), you know, then this is simply the way it is. Thus: you're right. I do agree with what you say, we here have a (8-bit) C++ library which does this in it's setup():
// Initialize those Locale variables we're responsible for
Locale::_ctype_cclass = Locale::_posix_cclass;
Locale::_ctype_ccase = Locale::_posix_ccase;
(Like i said: we here went completely grazy and avoid system libraries whenever possible and at least directly, doing the stuff ourselfs and only with syscalls.)
Besides that i would agree with me that unthreaded init, optional embeddor locale argument, cleanup of .getprefer...() and other drops of setlocale() are/would be good design decisions. And of course: "keeping the thing simple and understandable" is a thing to keep in mind in respect to a normal user.
After the end (i have to excuse myself once again for a book):
I, f.e., opened an issue 11059 on saturday because the HG repo was (2.7 may still be) not cloneable, and i did so at selenic, too. Notes on that:
- pitrou closed it because this tracker is of course for Python bugs. (I asked him to decide - thanks.)
- The selenic people told me that i added my trace to a completely wrong issue. (Just searched - that's more than shown in trace dump.)
- I've found out that many, *many* issues seem to have been created due to this repo failure at python.org (at selenic), and i've added a note that they possibly should include a prominent notice that people should look for "most recent call last" before creating a new one. (I guess that most of these people are programmers - who else uses HG?)
- Conclusion: maybe even os.environ[]= == locale.setlocale() is not simple minded enough.
|
msg162329 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2012-06-05 05:37 |
I think it's absolutely necessary that text files, by default, are opened in the encoding of the user's locale, whether the script has called setlocale or not.
There are reasons for C to not automatically call setlocale at startup (mostly backwards compatibility), but they don't apply to Python.
|
msg162339 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2012-06-05 11:48 |
New changeset 2587328c7c9c by Victor Stinner in branch 'default':
Close #11022: TextIOWrapper doesn't call locale.setlocale() anymore
http://hg.python.org/cpython/rev/2587328c7c9c
|
msg164989 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2012-07-08 10:08 |
New changeset 6651c932d014 by Florent Xicluna in branch 'default':
Issue #11022 and #15287: correctly remove the TESTFN file in test_builtin.
http://hg.python.org/cpython/rev/6651c932d014
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:11 | admin | set | github: 55231 |
2012-07-08 10:08:58 | python-dev | set | messages:
+ msg164989 |
2012-06-05 11:48:17 | python-dev | set | status: open -> closed
nosy:
+ python-dev messages:
+ msg162339
resolution: fixed stage: patch review -> resolved |
2012-06-05 05:37:22 | loewis | set | messages:
+ msg162329 |
2012-06-04 23:47:22 | vstinner | set | title: locale.setlocale() doesn't change I/O codec, os.environ does -> locale.getpreferredencoding() must not set temporary LC_CTYPE |
2011-02-01 11:58:24 | sdaoden | set | nosy:
lemburg, loewis, pitrou, vstinner, Arfrever, r.david.murray, sdaoden messages:
+ msg127693 |
2011-02-01 02:43:23 | r.david.murray | set | nosy:
+ r.david.murray messages:
+ msg127679
|
2011-02-01 00:05:46 | pitrou | set | nosy:
lemburg, loewis, pitrou, vstinner, Arfrever, sdaoden stage: patch review |
2011-02-01 00:03:27 | vstinner | set | files:
+ io_dont_set_locale.patch
messages:
+ msg127675 keywords:
+ patch nosy:
lemburg, loewis, pitrou, vstinner, Arfrever, sdaoden |
2011-01-29 13:50:25 | sdaoden | set | nosy:
lemburg, loewis, pitrou, vstinner, Arfrever, sdaoden messages:
+ msg127416 |
2011-01-28 15:01:13 | Arfrever | set | nosy:
lemburg, loewis, pitrou, vstinner, Arfrever, sdaoden title: locale.setlocale() doesn't change I/O codec, os.environ -> locale.setlocale() doesn't change I/O codec, os.environ does |
2011-01-28 09:29:33 | lemburg | set | nosy:
+ lemburg title: locale.setlocale() doesn't change I/O codec, os.environ -> locale.setlocale() doesn't change I/O codec, os.environ messages:
+ msg127263
|
2011-01-27 22:35:50 | pitrou | set | nosy:
loewis, pitrou, vstinner, Arfrever, sdaoden messages:
+ msg127234 title: locale.setlocale() doesn't change I/O codec, os.environ -> locale.setlocale() doesn't change I/O codec, os.environ |
2011-01-27 22:32:55 | vstinner | set | nosy:
loewis, pitrou, vstinner, Arfrever, sdaoden messages:
+ msg127233 |
2011-01-27 22:22:31 | vstinner | set | nosy:
loewis, pitrou, vstinner, Arfrever, sdaoden messages:
+ msg127231 versions:
+ Python 3.3, - Python 3.1, Python 3.2 |
2011-01-27 19:31:02 | loewis | set | nosy:
loewis, pitrou, vstinner, Arfrever, sdaoden messages:
+ msg127214 title: locale.setlocale() doesn't change I/O codec, os.environ[] does -> locale.setlocale() doesn't change I/O codec, os.environ |
2011-01-27 18:50:21 | pitrou | set | versions:
+ Python 3.2 nosy:
+ loewis, pitrou
messages:
+ msg127209
components:
+ IO |
2011-01-27 16:58:48 | Arfrever | set | nosy:
+ Arfrever
|
2011-01-27 13:47:22 | sdaoden | set | messages:
+ msg127188 |
2011-01-27 11:24:41 | vstinner | set | messages:
+ msg127179 |
2011-01-27 11:19:27 | vstinner | set | nosy:
+ vstinner messages:
+ msg127178
|
2011-01-27 11:01:17 | sdaoden | set | type: behavior |
2011-01-27 11:00:23 | sdaoden | create | |