Message 324219 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	Michael.Felt, michael-o, terry.reedy, vstinner
Date	2018-08-28.08:11:02
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1535443862.67.0.56676864532.issue34403@psf.upfronthosting.co.za>
In-reply-to

Content
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 214, in check > self.assertEqual(args, ascii(expected), out) > AssertionError: "['h\\xa7\\xe9']" != "['h\\xcf\\xd5']" > - ['h\xa7\xe9'] > + ['h\xcf\xd5'] > : roman8:['h\xa7\xe9'] Hum, it looks like a bug in the C library of HP-UX. It announces that the locale encoding is "roman8", but the mbstowcs() function decodes from the Latin1 encoding. The updated test uses the byte string: b'h\xa7\xe9'. The OS announces the encoding roman8, so the test expects the Unicode string: b'h\xa7\xe9'.decode('roman8') == 'h\xcf\xd5'.... but it gets 'h\xa7\xe9' which looks more like the byte string has been decoded from Latin1: b'h\xa7\xe9'.decode('latin1') == 'h\xa7\xe9'. Michael: would you mind to compile and run the attached c_locale.c test program? It sets the LC_ALL locale to C, dump locales (LC_ALL, LC_CTYPE, nl_langinfo(CODESET)), and then decode all bytes from the locale encoding (LC_CTYPE). The output should help me to understand what is the effective encoding of HP-UX for the C locale. You may modify the c_locale.c to replace "C" with "POSIX", to see if the behaviour is different.

>   File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 214, in check
>     self.assertEqual(args, ascii(expected), out)
> AssertionError: "['h\\xa7\\xe9']" != "['h\\xcf\\xd5']"
> - ['h\xa7\xe9']
> + ['h\xcf\xd5']
>  : roman8:['h\xa7\xe9']

Hum, it looks like a bug in the C library of HP-UX. It announces that the locale encoding is "roman8", but the mbstowcs() function decodes from the Latin1 encoding. The updated test uses the byte string: b'h\xa7\xe9'. The OS announces the encoding roman8, so the test expects the Unicode string: b'h\xa7\xe9'.decode('roman8') == 'h\xcf\xd5'.... but it gets 'h\xa7\xe9' which looks more like the byte string has been decoded from Latin1: b'h\xa7\xe9'.decode('latin1') == 'h\xa7\xe9'.

Michael: would you mind to compile and run the attached c_locale.c test program? It sets the LC_ALL locale to C, dump locales (LC_ALL, LC_CTYPE, nl_langinfo(CODESET)), and then decode all bytes from the locale encoding (LC_CTYPE). The output should help me to understand what is the *effective* encoding of HP-UX for the C locale.

You may modify the c_locale.c to replace "C" with "POSIX", to see if the behaviour is different.

History
Date	User	Action	Args
2018-08-28 08:11:02	vstinner	set	recipients: + vstinner, terry.reedy, Michael.Felt, michael-o
2018-08-28 08:11:02	vstinner	set	messageid: <1535443862.67.0.56676864532.issue34403@psf.upfronthosting.co.za>
2018-08-28 08:11:02	vstinner	link	issue34403 messages
2018-08-28 08:11:02	vstinner	create