Message 200957 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	BreamoreBoy, eric.smith, loewis, mark.dickinson, mcepl, skrah, vstinner
Date	2013-10-22.14:19:06
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1382451546.55.0.791294110743.issue7442@psf.upfronthosting.co.za>
In-reply-to

Content
msg95988> Hi, the following works in 2.7 but not in 3.x: ... Sure it works because Python 2 pass the raw byte string, it does not try to decode it. But did you try to display the result in a terminal for example? Example with Python 2 in an UTF-8 terminal: $ python Python 2.7.5 (default, Oct 8 2013, 12:19:40) [GCC 4.8.1 20130603 (Red Hat 4.8.1-1)] on linux2 >>> import locale >>> # set the locale encoding to UTF-8 ... locale.setlocale(locale.LC_CTYPE, 'fr_FR.utf8') 'fr_FR.utf8' >>> # set the thousand separator to U+00A0 ... locale.setlocale(locale.LC_NUMERIC, 'fi_FI') 'fi_FI' >>> locale.getlocale(locale.LC_CTYPE) ('fr_FR', 'UTF-8') >>> locale.getlocale(locale.LC_NUMERIC) ('fi_FI', 'ISO8859-15') >>> locale.format('%d', 123456, True) '123\xa0456' >>> print(locale.format('%d', 123456, True)) 123�456 Mojibake! � means that b'\xA0' cannot be decoded from the locale encoding (UTF-8). There is probably the same issue with a LC_MONETARY using a different encoding than LC_CTYPE. > I suspect that this is related: #5905 It is unrelated: time.strftime() uses the LC_CTYPE, but the Python was using the wrong encoding. Python used the locale encoding read at startup, whereas the current locale encoding must be used. This issue is specific to LC_NUMERIC with a LC_CTYPE using different encoding. > If I set LC_CTYPE and LC_NUMERIC together, things work. Sure, because in this case, LC_NUMERIC produces data in the same encoding than LC_CTYPE. > call setlocale(LC_CTYPE, setlocale(LC_NUMERIC, NULL)) before > mbstowcs. This is not really an option. Setting a locale is process-wide and should be avoided. FYI locale.getpreferredencoding() changes temporarly the LC_CTYPE by default, it only uses the current LC_CTYPE if you pass False. open() changed temporarly LC_CTYPE because of that in Python 3.0-3.2 (see issue #11022). The following PostgreSQL issue looks to be the same than this Python issue: 4B7E07541D0@cvs.postgresql.org">http://www.postgresql.org/message-id/20100422015552.4B7E07541D0@cvs.postgresql.org The fix changes temporarly the LC_CTYPE encoding: #ifdef WIN32 setlocale(LC_CTYPE, locale_monetary); #endif (I don't know why the code is specific to Windows.)

msg95988> Hi, the following works in 2.7 but not in 3.x: ...

Sure it works because Python 2 pass the raw byte string, it does not try to decode it. But did you try to display the result in a terminal for example?

Example with Python 2 in an UTF-8 terminal:

$ python
Python 2.7.5 (default, Oct  8 2013, 12:19:40) 
[GCC 4.8.1 20130603 (Red Hat 4.8.1-1)] on linux2
>>> import locale
>>> # set the locale encoding to UTF-8
... locale.setlocale(locale.LC_CTYPE, 'fr_FR.utf8')
'fr_FR.utf8'
>>> # set the thousand separator to U+00A0
... locale.setlocale(locale.LC_NUMERIC, 'fi_FI')
'fi_FI'
>>> locale.getlocale(locale.LC_CTYPE)
('fr_FR', 'UTF-8')
>>> locale.getlocale(locale.LC_NUMERIC)
('fi_FI', 'ISO8859-15')
>>> locale.format('%d', 123456, True)
'123\xa0456'
>>> print(locale.format('%d', 123456, True))
123�456

Mojibake! � means that b'\xA0' cannot be decoded from the locale encoding (UTF-8).


There is probably the same issue with a LC_MONETARY using a different encoding than LC_CTYPE.


> I suspect that this is related: #5905

It is unrelated: time.strftime() uses the LC_CTYPE, but the Python was using the wrong encoding. Python used the locale encoding read at startup, whereas the *current* locale encoding must be used.

This issue is specific to LC_NUMERIC with a LC_CTYPE using different encoding.


> If I set LC_CTYPE and LC_NUMERIC together, things work.

Sure, because in this case, LC_NUMERIC produces data in the same encoding than LC_CTYPE.


> call setlocale(LC_CTYPE, setlocale(LC_NUMERIC, NULL)) before
> mbstowcs. This is not really an option.

Setting a locale is process-wide and should be avoided. FYI locale.getpreferredencoding() changes temporarly the LC_CTYPE by default, it only uses the current LC_CTYPE if you pass False. open() changed temporarly LC_CTYPE because of that in Python 3.0-3.2 (see issue #11022).

The following PostgreSQL issue looks to be the same than this Python issue:
4B7E07541D0@cvs.postgresql.org">http://www.postgresql.org/message-id/20100422015552.4B7E07541D0@cvs.postgresql.org

The fix changes temporarly the LC_CTYPE encoding:

#ifdef WIN32
setlocale(LC_CTYPE, locale_monetary);
#endif

(I don't know why the code is specific to Windows.)

History
Date	User	Action	Args
2013-10-22 14:19:06	vstinner	set	recipients: + vstinner, loewis, mark.dickinson, eric.smith, mcepl, skrah, BreamoreBoy
2013-10-22 14:19:06	vstinner	set	messageid: <1382451546.55.0.791294110743.issue7442@psf.upfronthosting.co.za>
2013-10-22 14:19:06	vstinner	link	issue7442 messages
2013-10-22 14:19:06	vstinner	create