This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Michael.Felt
Recipients Michael.Felt, michael-o, terry.reedy, vstinner
Date 2018-08-27.20:58:46
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <77b3d8a0-5304-cd3e-fab8-fb2af52359af@felt.demon.nl>
In-reply-to <1535376128.65.0.56676864532.issue34403@psf.upfronthosting.co.za>
Content
On 27/08/2018 15:22, Michael Osipov wrote:
> Michael Osipov <1983-01-06@gmx.net> added the comment:
>
> So I changed the test code to:
>
> diff --git a/Lib/test/test_utf8_mode.py b/Lib/test/test_utf8_mode.py
> index 26e2e13ec5..d9f8a3ed8b 100644
> --- a/Lib/test/test_utf8_mode.py
> +++ b/Lib/test/test_utf8_mode.py
> @@ -208,7 +208,7 @@ class UTF8ModeTests(unittest.TestCase):
>      def test_cmd_line(self):
>          arg = 'h\xe9\u20ac'.encode('utf-8')
>          arg_utf8 = arg.decode('utf-8')
> -        arg_ascii = arg.decode('ascii', 'surrogateescape')
> +        arg_ascii = arg.decode('roman8', 'surrogateescape')
>          code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
>
>          def check(utf8_opt, expected, **kw):
>
> and the output is:
> ======================================================================
> FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 224, in test_cmd_line
>     check('utf8=0', [c_arg], LC_ALL='C')
>   File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check
>     self.assertEqual(args, ascii(expected), out)
> AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']"
> - ['h\xc3\xa9\xe2\x82\xac']
> + ['h\xfb\u02cb\xe3\x82\u02dc']
>  : roman8:['h\xc3\xa9\xe2\x82\xac']
>
> I still don't understand that.
Something I found helpful was to change:

        check('utf8=0', [c_arg], LC_ALL='C')

to
        check('utf8=0', [c_arg], LC_ALL='C', failure=True )

This also fails, but it shows what is being executed.

Further, my 'understanding' is that ascii(whatever) is much smarter than
whatever.decode('ascii', ...) does. Also, ascii() tends to use the \x
shorthand, while decode('ascii', 'surrogateescape') uses the \udc prefix.

And, while you might still consider it a 'bug', did you try using c_arg
= arg.decode('iso-88859-1') ?

Michael (F)

>
> I believe that surrogate escape only works for ASCII and nothing else. If so, this test must be skipped on HP-UX and AIX.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue34403>
> _______________________________________
>
History
Date User Action Args
2018-08-27 20:58:46Michael.Feltsetrecipients: + Michael.Felt, terry.reedy, vstinner, michael-o
2018-08-27 20:58:46Michael.Feltlinkissue34403 messages
2018-08-27 20:58:46Michael.Feltcreate