Message 324196 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Michael.Felt
Recipients	Michael.Felt, michael-o, terry.reedy, vstinner
Date	2018-08-27.20:58:46
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<77b3d8a0-5304-cd3e-fab8-fb2af52359af@felt.demon.nl>
In-reply-to	<1535376128.65.0.56676864532.issue34403@psf.upfronthosting.co.za>

Content
On 27/08/2018 15:22, Michael Osipov wrote: > Michael Osipov <1983-01-06@gmx.net> added the comment: > > So I changed the test code to: > > diff --git a/Lib/test/test_utf8_mode.py b/Lib/test/test_utf8_mode.py > index 26e2e13ec5..d9f8a3ed8b 100644 > --- a/Lib/test/test_utf8_mode.py > +++ b/Lib/test/test_utf8_mode.py > @@ -208,7 +208,7 @@ class UTF8ModeTests(unittest.TestCase): > def test_cmd_line(self): > arg = 'h\xe9\u20ac'.encode('utf-8') > arg_utf8 = arg.decode('utf-8') > - arg_ascii = arg.decode('ascii', 'surrogateescape') > + arg_ascii = arg.decode('roman8', 'surrogateescape') > code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))' > > def check(utf8_opt, expected, **kw): > > and the output is: > ====================================================================== > FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 224, in test_cmd_line > check('utf8=0', [c_arg], LC_ALL='C') > File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check > self.assertEqual(args, ascii(expected), out) > AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']" > - ['h\xc3\xa9\xe2\x82\xac'] > + ['h\xfb\u02cb\xe3\x82\u02dc'] > : roman8:['h\xc3\xa9\xe2\x82\xac'] > > I still don't understand that. Something I found helpful was to change: check('utf8=0', [c_arg], LC_ALL='C') to check('utf8=0', [c_arg], LC_ALL='C', failure=True ) This also fails, but it shows what is being executed. Further, my 'understanding' is that ascii(whatever) is much smarter than whatever.decode('ascii', ...) does. Also, ascii() tends to use the \x shorthand, while decode('ascii', 'surrogateescape') uses the \udc prefix. And, while you might still consider it a 'bug', did you try using c_arg = arg.decode('iso-88859-1') ? Michael (F) > > I believe that surrogate escape only works for ASCII and nothing else. If so, this test must be skipped on HP-UX and AIX. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue34403> > _______________________________________ >

On 27/08/2018 15:22, Michael Osipov wrote:
> Michael Osipov <1983-01-06@gmx.net> added the comment:
>
> So I changed the test code to:
>
> diff --git a/Lib/test/test_utf8_mode.py b/Lib/test/test_utf8_mode.py
> index 26e2e13ec5..d9f8a3ed8b 100644
> --- a/Lib/test/test_utf8_mode.py
> +++ b/Lib/test/test_utf8_mode.py
> @@ -208,7 +208,7 @@ class UTF8ModeTests(unittest.TestCase):
>      def test_cmd_line(self):
>          arg = 'h\xe9\u20ac'.encode('utf-8')
>          arg_utf8 = arg.decode('utf-8')
> -        arg_ascii = arg.decode('ascii', 'surrogateescape')
> +        arg_ascii = arg.decode('roman8', 'surrogateescape')
>          code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
>
>          def check(utf8_opt, expected, **kw):
>
> and the output is:
> ======================================================================
> FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 224, in test_cmd_line
>     check('utf8=0', [c_arg], LC_ALL='C')
>   File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check
>     self.assertEqual(args, ascii(expected), out)
> AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']"
> - ['h\xc3\xa9\xe2\x82\xac']
> + ['h\xfb\u02cb\xe3\x82\u02dc']
>  : roman8:['h\xc3\xa9\xe2\x82\xac']
>
> I still don't understand that.
Something I found helpful was to change:

        check('utf8=0', [c_arg], LC_ALL='C')

to
        check('utf8=0', [c_arg], LC_ALL='C', failure=True )

This also fails, but it shows what is being executed.

Further, my 'understanding' is that ascii(whatever) is much smarter than
whatever.decode('ascii', ...) does. Also, ascii() tends to use the \x
shorthand, while decode('ascii', 'surrogateescape') uses the \udc prefix.

And, while you might still consider it a 'bug', did you try using c_arg
= arg.decode('iso-88859-1') ?

Michael (F)

>
> I believe that surrogate escape only works for ASCII and nothing else. If so, this test must be skipped on HP-UX and AIX.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue34403>
> _______________________________________
>

History
Date	User	Action	Args
2018-08-27 20:58:46	Michael.Felt	set	recipients: + Michael.Felt, terry.reedy, vstinner, michael-o
2018-08-27 20:58:46	Michael.Felt	link	issue34403 messages
2018-08-27 20:58:46	Michael.Felt	create