Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locale documentation doesn't mention that LC_CTYPE is changed at startup #50452

Closed
ned-deily opened this issue Jun 5, 2009 · 27 comments
Closed
Labels
docs Documentation in the Doc dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@ned-deily
Copy link
Member

BPO 6203
Nosy @malemburg, @loewis, @birkenfeld, @pitrou, @vstinner, @ned-deily, @ezio-melotti, @bitdancer, @akheron
Files
  • locale_doc.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-06-05.23:39:39.854>
    created_at = <Date 2009-06-05.10:56:37.084>
    labels = ['type-bug', 'expert-unicode', 'docs']
    title = "locale documentation doesn't mention that LC_CTYPE is changed at startup"
    updated_at = <Date 2012-06-05.23:39:39.832>
    user = 'https://github.com/ned-deily'

    bugs.python.org fields:

    activity = <Date 2012-06-05.23:39:39.832>
    actor = 'python-dev'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2012-06-05.23:39:39.854>
    closer = 'python-dev'
    components = ['Documentation', 'Unicode']
    creation = <Date 2009-06-05.10:56:37.084>
    creator = 'ned.deily'
    dependencies = []
    files = ['25830']
    hgrepos = []
    issue_num = 6203
    keywords = ['patch']
    message_count = 27.0
    messages = ['88932', '89016', '89077', '89084', '89088', '89089', '89090', '89101', '89102', '89120', '89136', '127180', '127262', '127265', '127283', '127347', '127350', '127351', '127417', '141830', '141847', '141872', '141890', '147174', '162340', '162355', '162380']
    nosy_count = 13.0
    nosy_names = ['lemburg', 'loewis', 'georg.brandl', 'pitrou', 'vstinner', 'ned.deily', 'ezio.melotti', 'Arfrever', 'r.david.murray', 'alexis', 'sdaoden', 'python-dev', 'petri.lehtinen']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue6203'
    versions = ['Python 3.2', 'Python 3.3']

    @ned-deily
    Copy link
    Member Author

    In the Library Reference section 22.2.1 for locale, it states:

    "Initially, when a program is started, the locale is the C locale, no
    matter what the user’s preferred locale is. The program must explicitly
    say that it wants the user’s preferred locale settings by calling
    setlocale(LC_ALL, '')."

    This is the case for python2.x:

    $ export LANG=en_US.UTF-8
    $ python2.5
    Python 2.5.4 (r254:67916, Feb 17 2009, 20:16:45) 
    [GCC 4.3.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale; locale.getlocale()
    (None, None)
    >>> locale.getdefaultlocale()
    ('en_US', 'UTF8')
    >>> 
    
    but not for 3.1:
    $ python3.1
    Python 3.1a1+ (py3k, Mar 23 2009, 00:12:12) 
    [GCC 4.3.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale; locale.getlocale()
    ('en_US', 'UTF8')
    >>> locale.getdefaultlocale()
    ('en_US', 'UTF8')
    >>> 

    Either the code is incorrect in 3.1 or the documentation should be
    updated.

    @ned-deily ned-deily added docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error labels Jun 5, 2009
    @ezio-melotti
    Copy link
    Member

    Confirmed for 3.1, 3.0 still returns (None, None).

    @ezio-melotti ezio-melotti added the stdlib Python modules in the Lib dir label Jun 6, 2009
    @birkenfeld
    Copy link
    Member

    Deferring to Martin which one is correct :)

    @birkenfeld birkenfeld assigned loewis and unassigned birkenfeld Jun 8, 2009
    @bitdancer
    Copy link
    Member

    This is definately a bug in 3.1, for the same reason that a C program
    uses the C locale until an explicit setlocale is done: otherwise, a
    non-locale-aware program can run into bugs resulting from locale issues
    when run under a different locale than that of the program author.

    I have a memory of this being reported before somewhere and someone
    tracking it down to a change in python initialization, but I can't find
    a bug report and my google-foo is failing me.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 8, 2009

    For some reason only LC_CTYPE is affected:

    >>> locale.getlocale(locale.LC_CTYPE)
    ('fr_FR', 'UTF8')
    >>> locale.getlocale(locale.LC_MESSAGES)
    (None, None)
    >>> locale.getlocale(locale.LC_TIME)
    (None, None)
    >>> locale.getlocale(locale.LC_NUMERIC)
    (None, None)
    >>> locale.getlocale(locale.LC_COLLATE)
    (None, None)

    @bitdancer
    Copy link
    Member

    Ah, I can tell you exactly why that is, then. I noticed this in
    pythonrun.c while grepping the source:

    #ifdef HAVE_SETLOCALE
            /* Set up the LC_CTYPE locale, so we can obtain
               the locale's charset without having to switch
               locales. */
            setlocale(LC_CTYPE, "");
    #endif

    SVN blames Martin in r56922, so this case is assigned appropriately.
    Perhaps changing only LC_CTYPE is safe? I must admit to ignorance as to
    what all the LC variables mean/control.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 8, 2009

    It would still be better it is was unset afterwards. Third-party
    extensions could have LC_CTYPE-dependent behaviour.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 8, 2009

    It would still be better it is was unset afterwards. Third-party
    extensions could have LC_CTYPE-dependent behaviour.

    In principle, they could, yes - but what specific behavior might that
    be? What will change is character classification, which I consider
    fairly harmless. Also, multi-byte conversion routines will change, which
    is the primary reason for leaving it modified.

    @loewis loewis mannequin changed the title 3.x locale does not default to C, contrary to the documentation and to 2.x behavior 3.x locale does not default to C, contrary to the documentation and to 2.x behavior Jun 8, 2009
    @pitrou
    Copy link
    Member

    pitrou commented Jun 8, 2009

    In principle, they could, yes - but what specific behavior might that
    be? What will change is character classification, which I consider
    fairly harmless. Also, multi-byte conversion routines will change, which
    is the primary reason for leaving it modified.

    Ok, so I suppose we could leave the code as-is.

    @pitrou pitrou changed the title 3.x locale does not default to C, contrary to the documentation and to 2.x behavior 3.x locale does not default to C, contrary to the documentation and to 2.x behavior Jun 8, 2009
    @bitdancer
    Copy link
    Member

    Since it controls what is considered to be whitespace, it is possible
    this will lead to subtle bugs, but I agree that it seems relatively
    benign, especially considering 3.x's unicode orientation. So, this
    becomes a doc bug...

    @bitdancer bitdancer removed stdlib Python modules in the Lib dir release-blocker labels Jun 8, 2009
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 9, 2009

    To add a little bit more analysis: posix.device_encoding requires that
    the LC_CTYPE is set. Setting it just in this function would not be
    possible, as setlocale is not thread-safe.

    So for 3.1, it seems that Python must set LC_CTYPE. If somebody can
    propose a patch that avoids that for 3.2, I'd be certainly in favor.

    @loewis loewis mannequin removed their assignment Jun 9, 2009
    @admin admin mannequin assigned docspython and unassigned birkenfeld Oct 29, 2010
    @vstinner
    Copy link
    Member

    To add a little bit more analysis: posix.device_encoding requires that
    the LC_CTYPE is set. Setting it just in this function would not be
    possible, as setlocale is not thread-safe.

    open() does indirectly (locale.getpreferredencoding()) change temporary the locale (set LC_CTYPE to "") if the file is not a TTY (if it is a TTY, device_encoding() calls nl_langinfo(CODESET) without changing the current locale). If setlocale() is not thread-safe we have (maybe?) a problem here. See also bpo-11022: report of an user not understanding why setlocale() doesn't impact open() (TextIOWrapper) encoding). A quick solution is to call locale.getpreferredencoding(False) which doesn't change the locale.

    Do you really need os.device_encoding()? If we change TextIOWrapper to call locale.getpreferredencoding(False), os.device_encoding() and locale.getpreferredencoding(False) will give the same result. Except on Windows: os.device_encoding() uses GetConsoleCP() if fd==0 and GetConsoleOutputCP() if fd in (1, 2). But we can use GetConsoleCP() and GetConsoleOutputCP() directly in initstdio(). If someone closes sys.std* and recreate them later: os.device_encoding() can be use explicitly to keep the previous behaviour.

    It would still be better it is was unset afterwards. Third-party
    extensions could have LC_CTYPE-dependent behaviour.

    If Python is embeded, it should not change the locale. Even if it is not embeded, it is maybe better to never set LC_CTYPE.

    It is too late to touch such critical point in Python 3.2, but we may change it in Python 3.3.

    @malemburg
    Copy link
    Member

    Python can be embedded into other applications and unconditionally
    changing the locale (esp. the LC_CTYPE) is not good practice, since
    it's not thread-safe and affects the entire process. An application
    may have set LC_CTYPE (or the locale) to something completely
    different.

    If at all, Python should be more careful using this call (pseudo
    code):

    lc_ctype = setlocale(LC_CTYPE, NULL);
    if (lc_ctype == NULL || strcmp(lc_ctype, "") || strcmp(lc_ctype, "C")) {
        env_lc_ctype = setlocale(LC_CTYPE, "");
        setlocale(LC_CTYPE, lc_ctype);
        lc_ctype = env_lc_ctype;
    }

    Then use lc_ctype to figure out encodings, etc.

    While this is not thread-safe, it at least reverts the change back
    to the original setting and only applies the change if needed. That's
    still not optimal, but better than nothing.

    An clean alternative would be adding LC_* variable parsing code to
    Python to avoid the setlocale() call altogether.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 28, 2011

    An clean alternative would be adding LC_* variable parsing code to
    Python to avoid the setlocale() call altogether.

    That would be highly non-portable, and repeat the mistakes of
    getdefaultlocale.

    @loewis loewis mannequin changed the title 3.x locale does not default to C, contrary to the documentation and to 2.x behavior 3.x locale does not default to C, contrary to the documentation and to 2.x behavior Jan 28, 2011
    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    > An clean alternative would be adding LC_* variable parsing code to
    > Python to avoid the setlocale() call altogether.

    That would be highly non-portable, and repeat the mistakes of
    getdefaultlocale.

    You say that often, but I don't really know why. It's certainly portable
    between various Unix platforms, perhaps not Windows, but then i18n
    on Windows is a different story altogether.

    BTW: For Windows, you can adjust setlocale() to work thread-based
    using: _configthreadlocale()
    (http://msdn.microsoft.com/de-de/library/26c0tb7x(v=vs.80).aspx)

    Perhaps we ought to expose this in _locale and use it in
    getdefaultlocal() on Windows to query the locale settings
    via the pseudocode I posted.

    @Arfrever Arfrever mannequin changed the title 3.x locale does not default to C, contrary to the documentation and to 2.x behavior 3.x locale does not default to C, contrary to the documentation and to 2.x behavior Jan 28, 2011
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 28, 2011

    > That would be highly non-portable, and repeat the mistakes of
    > getdefaultlocale.

    You say that often, but I don't really know why. It's certainly portable
    between various Unix platforms, perhaps not Windows, but then i18n
    on Windows is a different story altogether.

    No, it's absolutely not portable across Unix platforms. Looking at
    LANG or LC_ALL does *not* allow you to infer the region name, or
    the locale's character set. For example, using glibc, in some
    installations, /etc/locale.alias is considered to map a value of LANG
    to the final locale name. As an option, glibc also considers a
    LOCALE_ALIAS_PATH that may point to a (colon-separated) path of
    files to search for locale aliases.

    Other systems may use other databases to map a locale name to locale
    properties.

    Unless you know exactly what version of C library is running on
    a system, parsing environment variables yourself is doomed to fail.

    @loewis loewis mannequin changed the title 3.x locale does not default to C, contrary to the documentation and to 2.x behavior 3.x locale does not default to C, contrary to the documentation and to 2.x behavior Jan 28, 2011
    @Arfrever
    Copy link
    Mannequin

    Arfrever mannequin commented Jan 28, 2011

    Martin v. Löwis:
    It seems that your web browser replaces ", " with ",\t" in the title (where "\t" is a tab character) each time you add a comment.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 28, 2011

    More likely, it's my email reader. Sorry about that.

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Jan 29, 2011

    User lemburg pointed me to this, but no, i've posted msg127416 to bpo-11022.

    @alexis
    Copy link
    Mannequin

    alexis mannequin commented Aug 9, 2011

    Maybe could it be useful to specify in the documentation that getlocale() is not intended to be used to get information about what is the locale of the system?

    It's not explained currently and thus it's a bit weird to have getlocale returning (None, None) even if you have your locales set.

    @bitdancer
    Copy link
    Member

    This issue is about the fact that it doesn't return (None, None). We should probably decide what we are going to do about that before changing the docs if they need it.

    @alexis
    Copy link
    Mannequin

    alexis mannequin commented Aug 10, 2011

    I see two different things here:

    1. the fact that getlocale() doesn't return (None, None) on some python
      versions
    2. the fact that having it returning (None, None) by default is a bit
      misleading as users may think that getlocale() is tied to environment
      variables. That's what was at the origin of bpo-12699

    My last remark is about the second bit. Maybe should I start a new issue
    for this?

    @alexis alexis mannequin changed the title 3.x locale does not default to C, contrary to the documentation and to 2.x behavior 3.x locale does not default to C, contrary to the documentation and to 2.x behavior Aug 10, 2011
    @bitdancer
    Copy link
    Member

    Yes a new issue would be more appropriate.

    @akheron
    Copy link
    Member

    akheron commented Nov 6, 2011

    If the thread safety of setlocale() is a problem, does anybody know how portable uselocale() is? It sets the locale of the current thread only, so it's safe to temporarily change the locale and then set it back.

    @vstinner
    Copy link
    Member

    vstinner commented Jun 5, 2012

    Either the code is incorrect in 3.1
    or the documentation should be updated.

    Leaving LC_CTYPE unchanged (use the "C" locale, which is ASCII in most
    cases) at Python startup would be a major change in Python 3. I don't
    want to change this. You would see a lot of mojibake in your GUIs and get a lot of ugly surrogate characters in filenames (because of the PEP-393) if we don't set the LC_CTYPE to the user preferred encoding at startup anymore.

    Setting the LC_CTYPE to the user preferred encoding is just very
    convinient and helps Python to speak to the user though the console,
    to the filesystem, to pass arguments on a command line of a
    subprocess, etc. For example, you cannot pass non-ASCII characters to
    a subprocess, characters written by the user in your GUI, if your
    current LC_CTYPE locale is C (ASCII): you get an Unicode encode error.

    So it's just a documentation issue: see my attached patch.

    @vstinner vstinner changed the title 3.x locale does not default to C, contrary to the documentation and to 2.x behavior locale documentation doesn't mention that LC_CTYPE is changed at startup Jun 5, 2012
    @ned-deily
    Copy link
    Member Author

    LGTM

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 5, 2012

    New changeset 113cdce4663c by Victor Stinner in branch 'default':
    Close bpo-6203: Document that Python 3 sets LC_CTYPE at startup to the user's preferred locale encoding
    http://hg.python.org/cpython/rev/113cdce4663c

    @python-dev python-dev mannequin closed this as completed Jun 5, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants