Message 284647 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	Jan Niklas Hasse, Sworddragon, abarry, akira, barry, ezio.melotti, lemburg, methane, ncoghlan, r.david.murray, vstinner, yan12125
Date	2017-01-04.16:06:08
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<CAMpsgwYA3Cs8ofigMC+NoUEJmbi9Gia45CFeD-WdY2mEd501JQ@mail.gmail.com>
In-reply-to	<1483541173.33.0.931483633089.issue28180@psf.upfronthosting.co.za>

Content
> The default encoding in the C/POSIX locale is ASCII (which is the entire source of the problem). The reality is more complex than that :-) It depends on the OS. Some OS uses Latin1 for the POSIX locale. Some OS announces to use Latin1 for the POSIX locale, but use ASCII in practice :-) On these lying OS, Python decodes bytes 0x80..0xff using mbstowcs() to check if we get ASCII or Latin1: see the check_force_ascii() function. /* Workaround FreeBSD and OpenIndiana locale encoding issue with the C locale. On these operating systems, nl_langinfo(CODESET) announces an alias of the ASCII encoding, whereas mbstowcs() and wcstombs() functions use the ISO-8859-1 encoding. The problem is that os.fsencode() and os.fsdecode() use locale.getpreferredencoding() codec. For example, if command line arguments are decoded by mbstowcs() and encoded back by os.fsencode(), we get a UnicodeEncodeError instead of retrieving the original byte string. The workaround is enabled if setlocale(LC_CTYPE, NULL) returns "C", nl_langinfo(CODESET) announces "ascii" (or an alias to ASCII), and at least one byte in range 0x80-0xff can be decoded from the locale encoding. The workaround is also enabled on error, for example if getting the locale failed. (...) */

> The default encoding in the C/POSIX locale is ASCII (which is the entire source of the problem).

The reality is more complex than that :-) It depends on the OS.

Some OS uses Latin1 for the POSIX locale. Some OS announces to use
Latin1 for the POSIX locale, but use ASCII in practice :-) On these
lying OS, Python decodes bytes 0x80..0xff using mbstowcs() to check if
we get ASCII or Latin1: see the check_force_ascii() function.

/* Workaround FreeBSD and OpenIndiana locale encoding issue with the C locale.
   On these operating systems, nl_langinfo(CODESET) announces an alias of the
   ASCII encoding, whereas mbstowcs() and wcstombs() functions use the
   ISO-8859-1 encoding. The problem is that os.fsencode() and os.fsdecode() use
   locale.getpreferredencoding() codec. For example, if command line arguments
   are decoded by mbstowcs() and encoded back by os.fsencode(), we get a
   UnicodeEncodeError instead of retrieving the original byte string.

   The workaround is enabled if setlocale(LC_CTYPE, NULL) returns "C",
   nl_langinfo(CODESET) announces "ascii" (or an alias to ASCII), and at least
   one byte in range 0x80-0xff can be decoded from the locale encoding. The
   workaround is also enabled on error, for example if getting the locale
   failed.

    (...) */

History
Date	User	Action	Args
2017-01-04 16:06:09	vstinner	set	recipients: + vstinner, lemburg, barry, ncoghlan, ezio.melotti, r.david.murray, methane, akira, Sworddragon, yan12125, abarry, Jan Niklas Hasse
2017-01-04 16:06:09	vstinner	link	issue28180 messages
2017-01-04 16:06:08	vstinner	create