AIX: test_utf8_mode.test_cmd_line fails #78528

aixtools · 2018-08-06T16:06:51Z

BPO	34347
Nosy	@vstinner, @ambv, @aixtools, @michael-o
PRs	bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX #8923 [3.7] bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX (GH-8923) #14233
Dependencies	bpo-34207: test_cmd_line test_utf8_mode test_warnings fail in all FreeBSD 3.x (3.8) buildbots
Files	pEpkey.asc

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-08-31.14:17:50.382>
created_at = <Date 2018-08-06.16:06:50.873>
labels = ['interpreter-core', '3.8', 'type-bug', 'tests', '3.7']
title = 'AIX: test_utf8_mode.test_cmd_line fails'
updated_at = <Date 2019-06-19.20:07:50.320>
user = 'https://github.com/aixtools'

bugs.python.org fields:

activity = <Date 2019-06-19.20:07:50.320>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2018-08-31.14:17:50.382>
closer = 'vstinner'
components = ['Interpreter Core', 'Tests']
creation = <Date 2018-08-06.16:06:50.873>
creator = 'Michael.Felt'
dependencies = ['34207']
files = ['47733']
hgrepos = []
issue_num = 34347
keywords = ['patch', '3.7regression']
message_count = 18.0
messages = ['323214', '323222', '323223', '323250', '323319', '323831', '323941', '323942', '323961', '323996', '324067', '324097', '324179', '324181', '324409', '324419', '337636', '346078']
nosy_count = 4.0
nosy_names = ['vstinner', 'lukasz.langa', 'Michael.Felt', 'michael-o']
pr_nums = ['8923', '14233']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue34347'
versions = ['Python 3.7', 'Python 3.8']

aixtools · 2018-08-06T16:06:51Z

The test fails because

byte_str.decode('ascii', 'surragateescape')

is not what ascii(byte_str) - returns when called from the commandline.

Assumption: since " check('utf8', [arg_utf8])" succeeds I assume the parsing of the command-line is correct.

DETAILS
>>> arg = 'h\xe9\u20ac'.encode('utf-8')
>>> arg
b'h\xc3\xa9\xe2\x82\xac'

>>> arg.decode('ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'

I am having a difficult time getting the syntax correct for all the "escapes", so I added a print statement in the check routine:

test_cmd_line (test.test_utf8_mode.UTF8ModeTests) ...
code:import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:]))) arg:b'h\xc3\xa9\xe2\x82\xac'
out:UTF-8:['h\xe9\u20ac']

code:import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:]))) arg:b'h\xc3\xa9\xe2\x82\xac'
out:ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

test code with my debug statement (to generate above):

    def test_cmd_line(self):
        arg = 'h\xe9\u20ac'.encode('utf-8')
        arg_utf8 = arg.decode('utf-8')
        arg_ascii = arg.decode('ascii', 'surrogateescape')
        code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'

        def check(utf8_opt, expected, **kw):
            out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
            print("\ncode:%s arg:%s\nout:%s" % (code, arg, out))
            args = out.partition(':')[2].rstrip()
            self.assertEqual(args, ascii(expected), out)

        check('utf8', [arg_utf8])
        if sys.platform == 'darwin' or support.is_android:
            c_arg = arg_utf8
        else:
            c_arg = arg_ascii
        check('utf8=0', [c_arg], LC_ALL='C')

So the first check succeeds:

        check('utf8', [arg_utf8])

But the second does not:

FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line
    check('utf8=0', [c_arg], LC_ALL='C')
  File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 218, in check
    self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).

AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

['h\xc3\xa9\xe2\x82\xac']
+ [b'h\xc3\xa9\xe2\x82\xac']
? +
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

aixtools · 2018-08-06T20:10:55Z

In short, I do not understand how this passes on Linux.

This is python3-3.4.6 on sles12:

>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'
>>>

This is python3-3.7.0 on AIX:
>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'

If I am missing something essential here - please be blunt!

aixtools · 2018-08-06T20:26:58Z

On 8/6/2018 10:10 PM, Michael Felt wrote:
> Michael Felt <michael@felt.demon.nl> added the comment:
>
> In short, I do not understand how this passes on Linux.
>
> This is python3-3.4.6 on sles12:
>
>>>> 'h\xe9\u20ac'.encode('utf-8')
> b'h\xc3\xa9\xe2\x82\xac'
>>>> ascii('h\xe9\u20ac'.encode('utf-8'))
> "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
> 'h\udcc3\udca9\udce2\udc82\udcac'
> This is python3-3.7.0 on AIX:
>>>> 'h\xe9\u20ac'.encode('utf-8')
> b'h\xc3\xa9\xe2\x82\xac'
>>>> ascii('h\xe9\u20ac'.encode('utf-8'))
> "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
> 'h\udcc3\udca9\udce2\udc82\udcac'
>
> If I am missing something essential here - please be blunt!
Also seeing the same with Windows.
C:\Users\MICHAELFelt>python
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32
bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('ascii','surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'
>>>
>
>

Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue34347\>

aixtools · 2018-08-07T20:23:35Z

Common "experts" - feedback needed!

Original
test test_utf8_mode failed -- Traceback (most recent call last):
  File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line
    check('utf8=0', [c_arg], LC_ALL='C')
  File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 217, in check
    self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

Modification #1:
if sys.platform == 'darwin' or support.is_android:
c_arg = arg_utf8
elif sys.platform.startswith("aix"):
c_arg = arg_ascii.encode('utf-8', 'surrogateescape')
else:
c_arg = arg_ascii
check('utf8=0', [c_arg], LC_ALL='C')

Result:
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

['h\xc3\xa9\xe2\x82\xac']
+ [b'h\xc3\xa9\xe2\x82\xac']
? +
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

Modifiction #2:
if sys.platform == 'darwin' or support.is_android:
c_arg = arg_utf8
elif sys.platform.startswith("aix"):
c_arg = arg
else:
c_arg = arg_ascii
check('utf8=0', [c_arg], LC_ALL='C')

AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

['h\xc3\xa9\xe2\x82\xac']
+ [b'h\xc3\xa9\xe2\x82\xac']
? +
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

The "expected" continues to be a "bytes" object, while the CLI code returns a non-byte string.
Or - the original has an ascii string object but uses \udc rather than \x

\udc is common (i.e., I see it frequently in googled results on other things) - should something in ascii() be changed to output \udc rather than \x ?

Thx!

aixtools · 2018-08-09T11:55:07Z

Starting this discussion again. Please take time to read. I have spent hours trying to understand what is failing. Please spend a few minutes reading.

Sadly, there is a lot of text - but I do not know what I could leave out without damaging the process of discovery.

The failing result is:

self.assertEqual(args, ascii(expected), out)

AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"

['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

The test code is:
+207 @unittest.skipIf(MS_WINDOWS, 'test specific to Unix')
+208 def test_cmd_line(self):
+209 arg = 'h\xe9\u20ac'.encode('utf-8')
+210 arg_utf8 = arg.decode('utf-8')
+211 arg_ascii = arg.decode('ascii', 'surrogateescape')
+212 code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
+213
+214 def check(utf8_opt, expected, **kw):
+215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
+216 args = out.partition(':')[2].rstrip()
+217 self.assertEqual(args, ascii(expected), out)
+218
+219 check('utf8', [arg_utf8])
+220 if sys.platform == 'darwin' or support.is_android:
+221 c_arg = arg_utf8
+222 else:
+223 c_arg = arg_ascii
+224 check('utf8=0', [c_arg], LC_ALL='C')

Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)

Question 2: It seems that what the test is 'checking' is that object.encode('utf-8') gets decoded by ascii() based on the utf8_mode set.

+215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)

rewrites (less indent) as:
+215 out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)

or
out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)

Finally, in Lib/test/support/script_helper.py we have
+127 print("\n", cmd_line) # debug info, ignore
+128 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
+129 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
+130 env=env, cwd=cwd)

Which gives:

['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

Above - utf8=1 - is successful

['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

Here: utf8=0 fails. The arg to the CLI is equal in both cases.
FAIL

## Goiing back to check() and what does it have:
## Add some debug. The first line is the 'raw' expected,
## the second line is ascii(decoded)
## the final is the value extracted from get_output

+214 def check(utf8_opt, expected, **kw):
+215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
+216 args = out.partition(':')[2].rstrip()
+217 print("")
+218 print("%s: expected\n%s:ascii(expected)\n%s:out" % (expected, ascii(expected), out))
+219 self.assertEqual(args, ascii(expected), out)

For: utf8 mode true, it works:
['h▒\u20ac']: expected
['h\xe9\u20ac']:ascii(expected)
UTF-8:['h\xe9\u20ac']:out

+221 check('utf8', [arg_utf8])

But not for utf8=0
+226 check('utf8=0', [c_arg], LC_ALL='C')
# note, different values for LC_ALL='C' have been tried
['h\udcc3\udca9\udce2\udc82\udcac']: expected
['h\udcc3\udca9\udce2\udc82\udcac']:ascii(expected)
ISO8859-1:['h\xc3\xa9\xe2\x82\xac']:out

## re: expected and ascii(expected)
When utf8=1 expected and ascii(expected) differ. "arg" looks different from both - but after processing by get_object() expected and out match.

When utf8=0 there is no difference is "arg1" passed to "code".
However, whith check - the values for both expected and ascii(expected) are identical. And, sadly, the value coming back via get_output looks nothing like 'expected'.

In short, when utf8=1 ascii(b'h\xc3\xa9\xe2\x82\xac') becomes ['h\xe9\u20ac' which is what is desired. But when utf8=0 ascii(b'h\xc3\xa9\xe2\x82\xac') is b'h\xc3\xa9\xe2\x82\xac' not 'h\udcc3\udca9\udce2\udc82\udcac'

Finally, when I run the command from the command line (after rewrites)

What passes:
./python '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'
UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']

encoding is UTF-8, but the result of ascii(argv[1]) is the same as argv[1]

./python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'

ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

Here, the only difference in the output is that the "UTF-8" has been changed to "ISO8859-1", i.e., I was expecting a difference is the result of ascii('bh\\xc3\\xa9\\xe2\\x82\\xac'). Instead, I see "bytes obj in", "bytes obj out" -- apparently unchanged. HOWEVER, the result returned by get_output is always different, even it is just limited to removing the 'b' quality.

Again: test result includes:
ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] - which is not equal to manual CLI with ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

So, I feel the issue is not with test, but within what happens after:

+127 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
+128 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
+129 env=env, cwd=cwd)

Specifically: here.

+130 with proc:
+131 try:
+132 out, err = proc.communicate()
+133 finally:
+134 proc.kill()
+135 subprocess._cleanup()
+136 rc = proc.returncode
+137 err = strip_python_stderr(err)
+138 return _PythonRunResult(rc, out, err), cmd_line

PASS:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
0 b"UTF-8:['h\\xe9\\u20ac']\n" b''

FAIL:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
0 b"ISO8859-1:['h\\xc3\\xa9\\xe2\\x82\\xac']\n" b''

Seems the 'b' quality disappears somehow with:
+216 args = out.partition(':')[2].rstrip()

So, maybe it is in test - in that line.

However, this goes well beyond my comprehension of python internal workings.

Hope this helps. Please comment.

ambv · 2018-08-21T14:55:43Z

I have no idea what's going on here yet but just wanted to report that we are seeing this issue on one FreeBSD buildbot, too:

https://buildbot.python.org/all/#/builders/124/builds/508/steps/4/logs/stdio

I can also reproduce on CentOS 7.

Could this be related to LC_ALL= or related environment variables?

vstinner · 2018-08-23T10:48:50Z

I fixed bpo-34207.

vstinner · 2018-08-23T10:51:30Z

Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().

Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)

Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.

aixtools · 2018-08-23T17:14:25Z

On 23/08/2018 12:51, STINNER Victor wrote:

STINNER Victor <vstinner@redhat.com> added the comment:

Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().
This is beyond my understanding atm.
Early on I tried making the expected just be 'arg' and went from
situation A to situation B - which looked much closer, BUT, the 'types'
differed:

Situaltion A (original)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"

['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).

Situation B
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

['h\xc3\xa9\xe2\x82\xac']
+ [b'h\xc3\xa9\xe2\x82\xac']
? +
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

After further digging - to understand why it was coming as "\x encoding rather than \udc"

I looked at what was happening here:

out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
becomes
out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
becomes
out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)

And finally, at the CLI becomes:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar
gv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']

/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

Note:
/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac'
ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac']

/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac'
ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (>
UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

Summary:
a) concerned about how b'h....' becomes 'bh....'
b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from
self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
determines the output and the (failed) comparison.

> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.

----------

Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue34347\>

aixtools · 2018-08-24T10:28:43Z

On 23/08/2018 19:14, Michael Felt wrote:

Michael Felt <aixtools@felt.demon.nl> added the comment:

On 23/08/2018 12:51, STINNER Victor wrote:
> STINNER Victor <vstinner@redhat.com> added the comment:
>
> Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().
This is beyond my understanding atm.
Early on I tried making the expected just be 'arg' and went from
situation A to situation B - which looked much closer, BUT, the 'types'
differed:

Situaltion A (original)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"

['h\xc3\xa9\xe2\x82\xac']

['h\udcc3\udca9\udce2\udc82\udcac']
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).

Situation B
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

['h\xc3\xa9\xe2\x82\xac']

[b'h\xc3\xa9\xe2\x82\xac']
? +
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

After further digging - to understand why it was coming as "\x encoding rather than \udc"

I looked at what was happening here:

out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
becomes
out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
becomes
out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)

And finally, at the CLI becomes:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar
gv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']

/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

Note:
/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac'
ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac']

/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac'
ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (>
UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

Summary:
a) concerned about how b'h....' becomes 'bh....'
b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from
self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
determines the output and the (failed) comparison.
p.s. also tried:
michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python
'-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys;
print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))',
'h\xe9\u20ac'.encode\('utf-8'\)
ISO8859-1:['h\\xe9\\u20ac.encode(utf-8)']
michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python
'-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys;
print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))',
'h\xe9\u20ac'.encode\('utf-8'\)
UTF-8:['h\\xe9\\u20ac.encode(utf-8)']

Really unclear to me what this test is trying to verify. The CLI seems
to just 'echo' what it is provided.
>>> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
>> Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.
>>
>>

>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue34347\>
> _______________________________________
>
----------

Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue34347\>

aixtools · 2018-08-25T13:40:42Z

Solution much simpler than I thought:

not arg.decode('ascii', 'surrogateescape'), but arg.decode('iso-8859-1')

michael-o · 2018-08-25T19:46:27Z

This is a very thorough analysis. Kudos to that.

vstinner · 2018-08-27T13:40:22Z

New changeset 7ef1697 by Victor Stinner (Michael Felt) in branch 'master':
bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX (GH-8923)
7ef1697

michael-o · 2018-08-27T14:24:44Z

Interesting is that the very same approach does not work for HP-UX even if I swap out the params for HP-UX:

$ ./python -m test test_utf8_mode
Run tests sequentially
0:00:00 [1/1] test_utf8_mode
test test_utf8_mode failed -- Traceback (most recent call last):
  File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 226, in test_cmd_line
    check('utf8=0', [c_arg], LC_ALL='C')
  File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check
    self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\xfb\u02cb\xe3\x82\u02dc']
 : roman8:['h\xc3\xa9\xe2\x82\xac']

aixtools · 2018-08-31T09:19:17Z

The buildbots seem happy. This may be closed.

vstinner · 2018-08-31T14:17:50Z

The buildbots seem happy. This may be closed.

Cool, thank you for checking, and thanks for your fix! I close the issue.

aixtools · 2019-03-10T18:48:13Z

Could this be backported to version 3.7?

vstinner · 2019-06-19T20:07:50Z

New changeset 15e7d24 by Victor Stinner (Michael Felt) in branch '3.7':
[3.7] bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX (GH-8923) (GH-14233)
15e7d24

aixtools added 3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error labels Aug 6, 2018

vstinner closed this as completed Aug 31, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIX: test_utf8_mode.test_cmd_line fails #78528

AIX: test_utf8_mode.test_cmd_line fails #78528

aixtools commented Aug 6, 2018

aixtools commented Aug 6, 2018

aixtools commented Aug 6, 2018

aixtools commented Aug 6, 2018

aixtools commented Aug 7, 2018

aixtools commented Aug 9, 2018

ambv commented Aug 21, 2018

vstinner commented Aug 23, 2018

vstinner commented Aug 23, 2018

aixtools commented Aug 23, 2018

aixtools commented Aug 24, 2018

aixtools commented Aug 25, 2018

michael-o mannequin commented Aug 25, 2018

vstinner commented Aug 27, 2018

michael-o mannequin commented Aug 27, 2018

aixtools commented Aug 31, 2018

vstinner commented Aug 31, 2018

aixtools commented Mar 10, 2019

vstinner commented Jun 19, 2019

AIX: test_utf8_mode.test_cmd_line fails #78528

AIX: test_utf8_mode.test_cmd_line fails #78528

Comments

aixtools commented Aug 6, 2018

aixtools commented Aug 6, 2018

aixtools commented Aug 6, 2018

aixtools commented Aug 6, 2018

aixtools commented Aug 7, 2018

aixtools commented Aug 9, 2018

ambv commented Aug 21, 2018

vstinner commented Aug 23, 2018

vstinner commented Aug 23, 2018

aixtools commented Aug 23, 2018

aixtools commented Aug 24, 2018

aixtools commented Aug 25, 2018

michael-o mannequin commented Aug 25, 2018

vstinner commented Aug 27, 2018

michael-o mannequin commented Aug 27, 2018

aixtools commented Aug 31, 2018

vstinner commented Aug 31, 2018

aixtools commented Mar 10, 2019

vstinner commented Jun 19, 2019