Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locale.getpreferredencoding() must not set temporary LC_CTYPE #55231

Closed
sdaoden mannequin opened this issue Jan 27, 2011 · 17 comments
Closed

locale.getpreferredencoding() must not set temporary LC_CTYPE #55231

sdaoden mannequin opened this issue Jan 27, 2011 · 17 comments
Labels
stdlib Python modules in the Lib dir topic-IO type-bug An unexpected behavior, bug, or error

Comments

@sdaoden
Copy link
Mannequin

sdaoden mannequin commented Jan 27, 2011

BPO 11022
Nosy @malemburg, @loewis, @pitrou, @vstinner, @bitdancer
Files
  • io_dont_set_locale.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-06-05.11:48:17.090>
    created_at = <Date 2011-01-27.11:00:23.388>
    labels = ['type-bug', 'library', 'expert-IO']
    title = 'locale.getpreferredencoding() must not set temporary LC_CTYPE'
    updated_at = <Date 2012-07-08.10:08:58.615>
    user = 'https://bugs.python.org/sdaoden'

    bugs.python.org fields:

    activity = <Date 2012-07-08.10:08:58.615>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-06-05.11:48:17.090>
    closer = 'python-dev'
    components = ['Library (Lib)', 'IO']
    creation = <Date 2011-01-27.11:00:23.388>
    creator = 'sdaoden'
    dependencies = []
    files = ['20637']
    hgrepos = []
    issue_num = 11022
    keywords = ['patch']
    message_count = 17.0
    messages = ['127177', '127178', '127179', '127188', '127209', '127214', '127231', '127233', '127234', '127263', '127416', '127675', '127679', '127693', '162329', '162339', '164989']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'loewis', 'pitrou', 'vstinner', 'Arfrever', 'r.david.murray', 'sdaoden', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue11022'
    versions = ['Python 3.3']

    @sdaoden
    Copy link
    Mannequin Author

    sdaoden mannequin commented Jan 27, 2011

    This bug may be based on same problem as bpo-6203.

    • My system locale is en_GB.UTF-8.
    • Given a latin1 text file, open()+ will fail with
      'UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6...'
    • Using locale.setlocale(..., ...)
    • Re-open causes same error, I/O layer codec has not been changed!
    • Using os.environ["LC_ALL"] = ...
    • Re-open works properly, I/O layer codec has been changed.
      P.S.: i am new to Python, please don't assume i can help in solving the problem!

    @sdaoden sdaoden mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 27, 2011
    @vstinner
    Copy link
    Member

    • Using locale.setlocale(..., ...)
    • Re-open causes same error, I/O layer codec has not been changed!

    Yes, this is the expected behaviour with the current code.

    TextIOWrapper uses indirectly locale.getpreferredencoding() to choose your file encoding. If locale has the CODESET constant, this function sets LC_CTYPE to "" and uses nl_langinfo(CODESET) to get the locale encoding.

    locale.getpreferredencoding() has an option to not set the LC_CTYPE to "": locale.getpreferredencoding(False).

    Example:
    ---------------------------

    $ python3.1
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from locale import getpreferredencoding, setlocale, LC_CTYPE
    >>> from locale import nl_langinfo, CODESET
    
    >>> setlocale(LC_CTYPE, None)
    'fr_FR.utf8'
    >>> getpreferredencoding()
    'UTF-8'
    >>> getpreferredencoding(False)
    'UTF-8'
    
    >>> setlocale(LC_CTYPE, 'fr_FR.iso88591')
    'fr_FR.iso88591'
    >>> nl_langinfo(CODESET)
    'ISO-8859-1'
    >>> getpreferredencoding()
    'UTF-8'
    >>> getpreferredencoding(False)
    'ISO-8859-1'

    Setting LC_CTYPE does change directly nl_langinfo(CODESET) result, but not getpreferredencoding() result because getpreferredencoding() doesn't care of the current locale: it uses its own LC_CTYPE value ("").

    getpreferredencoding(False) uses the current locale and give the expected result.

    • Using os.environ["LC_ALL"] = ...
    • Re-open works properly, I/O layer codec has been changed.

    Set LC_ALL works because getpreferredencoding() sets the LC_CTYPE to "" which will read the current value of the "LC_ALL" and "LC_CTYPE" environment variables.

    --

    Actually, TextIOWrapper doesn't use the current locale, it only uses (indirectly) the environment variables. I don't know which behaviour is better.

    If you would like that TextIOWrapper uses your current locale, use: open(filename, encoding=locale.getpreferredencoding(True)).

    Anyway, I don't know understand why do you change your locale, because you know that your file encoding is Latin1. Why don't you use directly: open(filename, encoding='latin1')?

    @vstinner
    Copy link
    Member

    This bug may be based on same problem as bpo-6203.

    Nope, both issues are different. Here you want that TextIOWrapper reads your current locale, and not your environment variables. Issue bpo-6203 asks why LC_CTYPE is not C by default, but the user locale LC_CTYPE (read from LC_ALL or LC_CTYPE environment variables).

    @sdaoden
    Copy link
    Mannequin Author

    sdaoden mannequin commented Jan 27, 2011

    Anyway, I don't know understand why do you change your locale,
    because you know that your file encoding is Latin1. Why don't you
    use directly: open(filename, encoding='latin1')?

    Fortunately bpo-9124 is being solved soon due to the very active
    happy hacker haypo ...
    I have read haypo's add-on to bpo-6203 and, since he refers to
    this issue here, i'll add some thoughts of mine, though they possibly
    should not belong into a bug tracker ...

    My misunderstanding was based upon an old project of mine,
    where i've used the environment to initialize the library state
    upon program startup only, but afterwards the entire handling was centralized upon some "Locale" class (changes therein were dispatched
    to and thus reflected by a TextCodec etc. - you may see bpo-9727,
    though my solution was hardwired).
    Like that turning one screw managed the entire system.

    If Python would be my project, i would change this code,
    because i do not see a real difference in os.environ[LC_]=
    and locale.setlocale(LC_,)!
    Both cases indicate the users desire to change a specific locale
    setting and thus - of course - all the changes which that implies!
    So why should there be a difference?

    What i really have to say is that the (3.1) implementation of getpreferredencoding() is horror, not only in respect to SMP
    (it's a no-go, then, even with locking, but that's not present).
    If Python would be mine (after thinking one hour without any
    feedback of anybody else), i would do the following:

    • upon program startup, init LibC environment:
      setlocale(LC_ALL, "");
      (see <http://pubs.opengroup.org/onlinepubs/009695399/functions/setlocale.html\>)
      Then init this very basic codeset in an unthreaded state:
      global_very_default_codeset = nl_langinfo(CODESET);
      After that, rename this terrible "do_setlocale" argument to
      "use_locale_active_on_program_startup".
      Then i would start a discussion wether such an argument is useful at
      all, because you possibly always ever say "False".
      Do ya???

    @pitrou
    Copy link
    Member

    pitrou commented Jan 27, 2011

    Both cases indicate the users desire to change a specific locale
    setting and thus - of course - all the changes which that implies!
    So why should there be a difference?

    I don't think it's intentional. I would be +1 on changing to getpreferredencoding(False).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 27, 2011

    > Both cases indicate the users desire to change a specific locale
    > setting and thus - of course - all the changes which that implies!
    > So why should there be a difference?

    I don't think it's intentional. I would be +1 on changing to getpreferredencoding(False).

    That won't actually work. If you always use the C library's locale
    setting, most scripts will run in the C locale, and fail to read text
    files properly.

    @loewis loewis mannequin changed the title locale.setlocale() doesn't change I/O codec, os.environ[] does locale.setlocale() doesn't change I/O codec, os.environ Jan 27, 2011
    @vstinner
    Copy link
    Member

    Set version to 3.3, I think that it is too late to change such critical code in Python 3.2.

    @vstinner
    Copy link
    Member

    upon program startup, init LibC environment: setlocale(LC_ALL, "");

    Python 3 does something like that: Py_InitializeEx() calls setlocale(LC_CTYPE, ""). But I (and others) consider that as a bug (see bpo-6203 discussion): Python should not do that (nor any library) implicitly, but a *program* can do that (once) at startup (explicitly).

    @pitrou
    Copy link
    Member

    pitrou commented Jan 27, 2011

    >> Both cases indicate the users desire to change a specific locale
    >> setting and thus - of course - all the changes which that implies!
    >> So why should there be a difference?
    >
    > I don't think it's intentional. I would be +1 on changing to getpreferredencoding(False).

    That won't actually work. If you always use the C library's locale
    setting, most scripts will run in the C locale, and fail to read text
    files properly.

    Well, is it any different from today? That's an innocent question: I
    don't know if there's a difference between "C locale" and "empty
    locale".

    @pitrou pitrou changed the title locale.setlocale() doesn't change I/O codec, os.environ locale.setlocale() doesn't change I/O codec, os.environ Jan 27, 2011
    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    > upon program startup, init LibC environment: setlocale(LC_ALL, "");

    Python 3 does something like that: Py_InitializeEx() calls setlocale(LC_CTYPE, ""). But I (and others) consider that as a bug (see bpo-6203 discussion): Python should not do that (nor any library) implicitly, but a *program* can do that (once) at startup (explicitly).

    Agreed. See the discussion on the ticket for more details.

    setlocale() should only be called by applications, not by libraries.
    For Python this means: calling it in main() is fine, but not
    in Py_InitializeEx().

    @malemburg malemburg changed the title locale.setlocale() doesn't change I/O codec, os.environ locale.setlocale() doesn't change I/O codec, os.environ Jan 28, 2011
    @Arfrever Arfrever mannequin changed the title locale.setlocale() doesn't change I/O codec, os.environ locale.setlocale() doesn't change I/O codec, os.environ does Jan 28, 2011
    @sdaoden
    Copy link
    Mannequin Author

    sdaoden mannequin commented Jan 29, 2011

    Also in respect to bpo-6203 i could talk about a project which did not link against anything in the end, only ld(1) and syscalls and the undocumented third 'char **envp' arg to UNIX main()s.
    Thus: all of you should be *very* happy about the warm and cosy environment of LibC etc.!
    You've decided to re-Python as Py3k, is guess it has got something to do with, let me describe it as, UNICODE.
    Thus: you need a locale.

    • Environment: has an encoding, though keys are ok to parse in ASCII
      (unless your OS allows wide characters *optionally*).
      Still, LC_ values may be specified in a *lot* of different ways,
      but one thing is true: it's a hard to do in plain C without being
      able to use stuff which *may* depend upon an initialized library
    • Path names: have an encoding
    • Console I/O: has an encoding
    • File I/O: this is all dumb bytes, just do what you want

    Conclusion: you need a locale.

    • Hardcode defaults
    • Spread specific things all across the implementation.
      I.e., in path access, use some os.path._sysdep.default_codeset(),
      in console I/O do os.console._sysdep.default_codeset() etc.
      (i'm lying about names)
    • Perform an initial global initialization

    So - what are you all talking about?
    Noone - and i really mean NOONE - can assume that a fully blown environment like python(1) can be used as an isolated sandbox thing
    like ECMAScript! File I/O, child processes ... Shall an entire interpreter lifecycle be possible in a signal(3) handler
    (uuhh, just kiddin')? Even if that would be true for 2.7 (don't know), in Py3k there is graceful and neatless UNICODE support.
    You need a locale.

    I would indeed insist on the following:

    • The interpreter *has* to be initialized in the cosy LibC
      (or whatever native thing) environment.
      Like this it embeds itself neatlessly in there.
      This *has* to be performed in an *unthreaded* state.
      If you are really concerned about anything here,
      add an additional argument (or is it there yet? I did *not*
      look in there - i would/will need long months to get an idea
      of the entire python(1) system) to your interpreter's setup()
      like thing, or allow NULL to nevertheless use setlocale() directly.
      Like this the embedder can choose herself which approach she
      wants to adhere.
    • Even if 3.DID_IT ends up with a lot of 'encoding=STRING' instead
      of 'codec=None' (aka 'codec=codec_instance'), i would implement
      the system in a way that a change at a single place is automatically
      reflected all through the system (on a no-arg-then-use-default)
      base.

    After the end:
    someone who earned about 150 bucks from me for two books i bought
    almost a decade ago once i've started Thinking In ... programming
    said some years ago (as i've read in the german magazine c't):
    "In Python i am even more productive than with Java."
    (I always was in doubt about that person - someone who is productive
    in Java, who may that be?)
    Thanks for python(1), and have a nice weekend.

    @vstinner
    Copy link
    Member

    vstinner commented Feb 1, 2011

    Attached patch replaces locale.getpreferredencoding() by locale.getpreferredencoding(False) in _io.TextIOWrapper and _pyio.TextIOWrapper.

    @bitdancer
    Copy link
    Member

    Steffan: I'm not sure what your post means, but I think there is a chance you might be confused about something. Python should *never* change the locale from the C locale. A Python *program* can do so, by calling setlocale, but Python itself should not. This is because when an arbitrary Python program is run, it needs to run in the C locale *unless it chooses otherwise*. To do anything else would produce a myriad portability problems for any code that is affected by locale settings (especially when the programmer doesn't know that it is so affected).

    This is orthogonal to the issue of deciding what encoding to use for various bits of I/O, where Python may need to discover what locale the user has chosen as a default. It's too bad libc makes this so hard to do safely.

    @sdaoden
    Copy link
    Mannequin Author

    sdaoden mannequin commented Feb 1, 2011

    Most of this is much too loud for a newbie who is about to read PEP-7 anyway. And if this community has chosen to try (?!?) not to break compatibility with code which does not have a notion of a locale setting (i.e. naively uses other code in that spirit), you know, then this is simply the way it is. Thus: you're right. I do agree with what you say, we here have a (8-bit) C++ library which does this in it's setup():

        // Initialize those Locale variables we're responsible for
        Locale::_ctype_cclass = Locale::_posix_cclass;
        Locale::_ctype_ccase = Locale::_posix_ccase;
    

    (Like i said: we here went completely grazy and avoid system libraries whenever possible and at least directly, doing the stuff ourselfs and only with syscalls.)

    Besides that i would agree with me that unthreaded init, optional embeddor locale argument, cleanup of .getprefer...() and other drops of setlocale() are/would be good design decisions. And of course: "keeping the thing simple and understandable" is a thing to keep in mind in respect to a normal user.

    After the end (i have to excuse myself once again for a book):
    I, f.e., opened an bpo-11059 on saturday because the HG repo was (2.7 may still be) not cloneable, and i did so at selenic, too. Notes on that:

    • pitrou closed it because this tracker is of course for Python bugs. (I asked him to decide - thanks.)
    • The selenic people told me that i added my trace to a completely wrong issue. (Just searched - that's more than shown in trace dump.)
    • I've found out that many, *many* issues seem to have been created due to this repo failure at python.org (at selenic), and i've added a note that they possibly should include a prominent notice that people should look for "most recent call last" before creating a new one. (I guess that most of these people are programmers - who else uses HG?)
    • Conclusion: maybe even os.environ[]= == locale.setlocale() is not simple minded enough.

    @vstinner vstinner changed the title locale.setlocale() doesn't change I/O codec, os.environ does locale.getpreferredencoding() must not set temporary LC_CTYPE Jun 4, 2012
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 5, 2012

    I think it's absolutely necessary that text files, by default, are opened in the encoding of the user's locale, whether the script has called setlocale or not.

    There are reasons for C to not automatically call setlocale at startup (mostly backwards compatibility), but they don't apply to Python.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 5, 2012

    New changeset 2587328c7c9c by Victor Stinner in branch 'default':
    Close bpo-11022: TextIOWrapper doesn't call locale.setlocale() anymore
    http://hg.python.org/cpython/rev/2587328c7c9c

    @python-dev python-dev mannequin closed this as completed Jun 5, 2012
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 8, 2012

    New changeset 6651c932d014 by Florent Xicluna in branch 'default':
    Issue bpo-11022 and bpo-15287: correctly remove the TESTFN file in test_builtin.
    http://hg.python.org/cpython/rev/6651c932d014

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-IO type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants