classification
Title: AIX: test_utf8_mode.test_cmd_line fails
Type: behavior Stage: resolved
Components: Interpreter Core, Tests Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: 34207 Superseder:
Assigned To: Nosy List: Michael.Felt, lukasz.langa, michael-o, vstinner
Priority: normal Keywords: 3.7regression, patch

Created on 2018-08-06 16:06 by Michael.Felt, last changed 2018-08-31 14:17 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
pEpkey.asc Michael.Felt, 2018-08-06 20:26
Pull Requests
URL Status Linked Edit
PR 8923 merged Michael.Felt, 2018-08-25 13:38
Messages (16)
msg323214 - (view) Author: Michael Felt (Michael.Felt) * Date: 2018-08-06 16:06
The test fails because

byte_str.decode('ascii', 'surragateescape')

is not what ascii(byte_str) - returns when called from the commandline.

Assumption: since " check('utf8', [arg_utf8])" succeeds I assume the parsing of the command-line is correct.

DETAILS
>>> arg = 'h\xe9\u20ac'.encode('utf-8')
>>> arg
b'h\xc3\xa9\xe2\x82\xac'

>>> arg.decode('ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'


I am having a difficult time getting the syntax correct for all the "escapes", so I added a print statement in the check routine:

test_cmd_line (test.test_utf8_mode.UTF8ModeTests) ...
code:import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:]))) arg:b'h\xc3\xa9\xe2\x82\xac'
out:UTF-8:['h\xe9\u20ac']

code:import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:]))) arg:b'h\xc3\xa9\xe2\x82\xac'
out:ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

test code with my debug statement (to generate above):

    def test_cmd_line(self):
        arg = 'h\xe9\u20ac'.encode('utf-8')
        arg_utf8 = arg.decode('utf-8')
        arg_ascii = arg.decode('ascii', 'surrogateescape')
        code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'

        def check(utf8_opt, expected, **kw):
            out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
            print("\ncode:%s arg:%s\nout:%s" % (code, arg, out))
            args = out.partition(':')[2].rstrip()
            self.assertEqual(args, ascii(expected), out)

        check('utf8', [arg_utf8])
        if sys.platform == 'darwin' or support.is_android:
            c_arg = arg_utf8
        else:
            c_arg = arg_ascii
        check('utf8=0', [c_arg], LC_ALL='C')

So the first check succeeds:

        check('utf8', [arg_utf8])

But the second does not:

FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line
    check('utf8=0', [c_arg], LC_ALL='C')
  File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 218, in check
    self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).

AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ [b'h\xc3\xa9\xe2\x82\xac']
?  +
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
msg323222 - (view) Author: Michael Felt (Michael.Felt) * Date: 2018-08-06 20:10
In short, I do not understand how this passes on Linux.

This is python3-3.4.6 on sles12:

>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'
>>>

This is python3-3.7.0 on AIX:
>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'

If I am missing something essential here - please be blunt!
msg323223 - (view) Author: Michael Felt (Michael.Felt) * Date: 2018-08-06 20:26
On 8/6/2018 10:10 PM, Michael Felt wrote:
> Michael Felt <michael@felt.demon.nl> added the comment:
>
> In short, I do not understand how this passes on Linux.
>
> This is python3-3.4.6 on sles12:
>
>>>> 'h\xe9\u20ac'.encode('utf-8')
> b'h\xc3\xa9\xe2\x82\xac'
>>>> ascii('h\xe9\u20ac'.encode('utf-8'))
> "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
> 'h\udcc3\udca9\udce2\udc82\udcac'
> This is python3-3.7.0 on AIX:
>>>> 'h\xe9\u20ac'.encode('utf-8')
> b'h\xc3\xa9\xe2\x82\xac'
>>>> ascii('h\xe9\u20ac'.encode('utf-8'))
> "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
> 'h\udcc3\udca9\udce2\udc82\udcac'
>
> If I am missing something essential here - please be blunt!
Also seeing the same with Windows.
C:\Users\MICHAELFelt>python
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32
bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('ascii','surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'
>>>
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue34347>
> _______________________________________
>
msg323250 - (view) Author: Michael Felt (Michael.Felt) * Date: 2018-08-07 20:23
Common "experts" - feedback needed!

Original
test test_utf8_mode failed -- Traceback (most recent call last):
  File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line
    check('utf8=0', [c_arg], LC_ALL='C')
  File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 217, in check
    self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

Modification #1:
        if sys.platform == 'darwin' or support.is_android:
            c_arg = arg_utf8
        elif sys.platform.startswith("aix"):
            c_arg = arg_ascii.encode('utf-8', 'surrogateescape')
        else:
            c_arg = arg_ascii
        check('utf8=0', [c_arg], LC_ALL='C')

Result:
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ [b'h\xc3\xa9\xe2\x82\xac']
?  +
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

Modifiction #2:
        if sys.platform == 'darwin' or support.is_android:
            c_arg = arg_utf8
        elif sys.platform.startswith("aix"):
            c_arg = arg
        else:
            c_arg = arg_ascii
        check('utf8=0', [c_arg], LC_ALL='C')

AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ [b'h\xc3\xa9\xe2\x82\xac']
?  +
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

The "expected" continues to be a "bytes" object, while the CLI code returns a non-byte string.
Or - the original has an ascii string object but uses \udc rather than \x

\udc is common (i.e., I see it frequently in googled results on other things) - should something in ascii() be changed to output \udc rather than \x ?

Thx!
msg323319 - (view) Author: Michael Felt (Michael.Felt) * Date: 2018-08-09 11:55
Starting this discussion again. Please take time to read. I have spent hours trying to understand what is failing. Please spend a few minutes reading.

Sadly, there is a lot of text - but I do not know what I could leave out without damaging the process of discovery.

The failing result is:

    self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

The test code is:
  +207      @unittest.skipIf(MS_WINDOWS, 'test specific to Unix')
  +208      def test_cmd_line(self):
  +209          arg = 'h\xe9\u20ac'.encode('utf-8')
  +210          arg_utf8 = arg.decode('utf-8')
  +211          arg_ascii = arg.decode('ascii', 'surrogateescape')
  +212          code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
  +213
  +214          def check(utf8_opt, expected, **kw):
  +215              out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
  +216              args = out.partition(':')[2].rstrip()
  +217              self.assertEqual(args, ascii(expected), out)
  +218
  +219          check('utf8', [arg_utf8])
  +220          if sys.platform == 'darwin' or support.is_android:
  +221              c_arg = arg_utf8
  +222          else:
  +223              c_arg = arg_ascii
  +224          check('utf8=0', [c_arg], LC_ALL='C')

Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)

Question 2: It seems that what the test is 'checking' is that object.encode('utf-8') gets decoded by ascii() based on the utf8_mode set.

 +215              out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)

rewrites (less indent) as:
 +215  out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)

or
out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)

Finally, in  Lib/test/support/script_helper.py we have
  +127      print("\n", cmd_line) # debug info, ignore
  +128      proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
  +129                           stdout=subprocess.PIPE, stderr=subprocess.PIPE,
  +130                           env=env, cwd=cwd)

Which gives:

 ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

Above - utf8=1 - is successful

 ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

Here: utf8=0 fails. The arg to the CLI is equal in both cases.
FAIL

## Goiing back to check() and what does it have:
## Add some debug. The first line is the 'raw' expected,
## the second line is ascii(decoded)
## the final is the value extracted from get_output

  +214          def check(utf8_opt, expected, **kw):
  +215              out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
  +216              args = out.partition(':')[2].rstrip()
  +217              print("")
  +218              print("%s: expected\n%s:ascii(expected)\n%s:out" % (expected, ascii(expected), out))
  +219              self.assertEqual(args, ascii(expected), out)

For: utf8 mode true, it works:
['h▒\u20ac']: expected
['h\xe9\u20ac']:ascii(expected)
UTF-8:['h\xe9\u20ac']:out

  +221          check('utf8', [arg_utf8])

But not for utf8=0
  +226          check('utf8=0', [c_arg], LC_ALL='C')
 # note, different values for LC_ALL='C' have been tried
['h\udcc3\udca9\udce2\udc82\udcac']: expected
['h\udcc3\udca9\udce2\udc82\udcac']:ascii(expected)
ISO8859-1:['h\xc3\xa9\xe2\x82\xac']:out

## re: expected and ascii(expected)
When utf8=1 expected and ascii(expected) differ. "arg" looks different from both - but after processing by get_object() expected and out match.

When utf8=0 there is no difference is "arg1" passed to "code".
However, whith check - the values for both expected and ascii(expected) are identical. And, sadly, the value coming back via get_output looks nothing like 'expected'.

In short, when utf8=1 ascii(b'h\xc3\xa9\xe2\x82\xac') becomes ['h\xe9\u20ac' which is what is desired. But when utf8=0 ascii(b'h\xc3\xa9\xe2\x82\xac') is b'h\xc3\xa9\xe2\x82\xac' not 'h\udcc3\udca9\udce2\udc82\udcac'

Finally, when I run the command from the command line (after rewrites)

What passes:
./python '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'
UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']

encoding is UTF-8, but the result of ascii(argv[1]) is the same as argv[1]

./python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'

ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

Here, the only difference in the output is that the "UTF-8" has been changed to "ISO8859-1", i.e., I was expecting a difference is the result of ascii('bh\\xc3\\xa9\\xe2\\x82\\xac'). Instead, I see "bytes obj in", "bytes obj out" -- apparently unchanged. HOWEVER, the result returned by get_output is always different, even it is just limited to removing the 'b' quality.

Again: test result includes:
 ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] - which is not equal to manual CLI with ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

So, I feel the issue is not with test, but within what happens after:

  +127      proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
  +128                           stdout=subprocess.PIPE, stderr=subprocess.PIPE,
  +129                           env=env, cwd=cwd)

Specifically: here.

  +130      with proc:
  +131          try:
  +132              out, err = proc.communicate()
  +133          finally:
  +134              proc.kill()
  +135              subprocess._cleanup()
  +136      rc = proc.returncode
  +137      err = strip_python_stderr(err)
  +138      return _PythonRunResult(rc, out, err), cmd_line


PASS:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
 0 b"UTF-8:['h\\xe9\\u20ac']\n" b''

FAIL:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
 0 b"ISO8859-1:['h\\xc3\\xa9\\xe2\\x82\\xac']\n" b''

Seems the 'b' quality disappears somehow with:
  +216              args = out.partition(':')[2].rstrip()

So, maybe it is in test - in that line.

However, this goes well beyond my comprehension of python internal workings.

Hope this helps. Please comment.
msg323831 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-08-21 14:55
I have no idea what's going on here yet but just wanted to report that we are seeing this issue on one FreeBSD buildbot, too:

https://buildbot.python.org/all/#/builders/124/builds/508/steps/4/logs/stdio

I can also reproduce on CentOS 7.

Could this be related to LC_ALL= or related environment variables?
msg323941 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-08-23 10:48
I fixed bpo-34207.
msg323942 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-08-23 10:51
Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().

> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)

Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.
msg323961 - (view) Author: Michael Felt (Michael.Felt) * Date: 2018-08-23 17:14
On 23/08/2018 12:51, STINNER Victor wrote:
> STINNER Victor <vstinner@redhat.com> added the comment:
>
> Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().
This is beyond my understanding atm.
Early on I tried making the expected just be 'arg' and went from
situation A to situation B - which looked much closer, BUT, the 'types'
differed:

Situaltion A (original)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).

Situation B
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ [b'h\xc3\xa9\xe2\x82\xac']
?  +
 : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

After further digging - to understand why it was coming as "\x encoding rather than \udc"

I looked at what was happening here:

out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
becomes
out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
becomes
out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)

And finally, at the CLI becomes:
['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar
gv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']

/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

Note:
/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac'
ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac']

/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac'
ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (>
UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

Summary:
a) concerned about how b'h....' becomes 'bh....'
b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from 
self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
 determines the output and the (failed) comparison.

>> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
> Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue34347>
> _______________________________________
>
msg323996 - (view) Author: Michael Felt (Michael.Felt) * Date: 2018-08-24 10:28
On 23/08/2018 19:14, Michael Felt wrote:
> Michael Felt <aixtools@felt.demon.nl> added the comment:
>
> On 23/08/2018 12:51, STINNER Victor wrote:
>> STINNER Victor <vstinner@redhat.com> added the comment:
>>
>> Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().
> This is beyond my understanding atm.
> Early on I tried making the expected just be 'arg' and went from
> situation A to situation B - which looked much closer, BUT, the 'types'
> differed:
>
> Situaltion A (original)
> AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
> - ['h\xc3\xa9\xe2\x82\xac']
> + ['h\udcc3\udca9\udce2\udc82\udcac']
>  : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
>
> I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).
>
> Situation B
> AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"
> - ['h\xc3\xa9\xe2\x82\xac']
> + [b'h\xc3\xa9\xe2\x82\xac']
> ?  +
>  : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']
>
> After further digging - to understand why it was coming as "\x encoding rather than \udc"
>
> I looked at what was happening here:
>
> out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
> becomes
> out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
> becomes
> out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)
>
> And finally, at the CLI becomes:
> ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
>
> /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar
> gv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
> UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']
>
> /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
> argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
> ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']
>
> Note:
> /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
> argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac'
> ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac']
>
> /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
> argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac'
> ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']
>
> root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (>
> UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']
>
> Summary:
> a) concerned about how b'h....' becomes 'bh....'
> b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from 
> self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
>  determines the output and the (failed) comparison.
p.s. also tried:
michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python
'-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys;
print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))',
'h\xe9\u20ac'.encode\('utf-8'\)
ISO8859-1:['h\\xe9\\u20ac.encode(utf-8)']
michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python
'-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys;
print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))',
'h\xe9\u20ac'.encode\('utf-8'\)
UTF-8:['h\\xe9\\u20ac.encode(utf-8)']

Really unclear to me what this test is trying to verify. The CLI seems
to just 'echo' what it is provided.
>>> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
>> Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.
>>
>> ----------
>>
>> _______________________________________
>> Python tracker <report@bugs.python.org>
>> <https://bugs.python.org/issue34347>
>> _______________________________________
>>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue34347>
> _______________________________________
>
msg324067 - (view) Author: Michael Felt (Michael.Felt) * Date: 2018-08-25 13:40
Solution much simpler than I thought:

not arg.decode('ascii', 'surrogateescape'), but arg.decode('iso-8859-1')
msg324097 - (view) Author: Michael Osipov (michael-o) * Date: 2018-08-25 19:46
This is a very thorough analysis. Kudos to that.
msg324179 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-08-27 13:40
New changeset 7ef1697be54a74314d5214d9ba0580d4e620694c by Victor Stinner (Michael Felt) in branch 'master':
bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX (GH-8923)
https://github.com/python/cpython/commit/7ef1697be54a74314d5214d9ba0580d4e620694c
msg324181 - (view) Author: Michael Osipov (michael-o) * Date: 2018-08-27 14:24
Interesting is that the very same approach does not work for HP-UX even if I swap out the params for HP-UX:

$ ./python -m test test_utf8_mode
Run tests sequentially
0:00:00 [1/1] test_utf8_mode
test test_utf8_mode failed -- Traceback (most recent call last):
  File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 226, in test_cmd_line
    check('utf8=0', [c_arg], LC_ALL='C')
  File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check
    self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\xfb\u02cb\xe3\x82\u02dc']
 : roman8:['h\xc3\xa9\xe2\x82\xac']
msg324409 - (view) Author: Michael Felt (Michael.Felt) * Date: 2018-08-31 09:19
The buildbots seem happy. This may be closed.
msg324419 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-08-31 14:17
> The buildbots seem happy. This may be closed.

Cool, thank you for checking, and thanks for your fix! I close the issue.
History
Date User Action Args
2018-08-31 14:17:50vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg324419

stage: patch review -> resolved
2018-08-31 09:19:16Michael.Feltsetmessages: + msg324409
2018-08-27 14:24:43michael-osetmessages: + msg324181
2018-08-27 13:40:21vstinnersetmessages: + msg324179
2018-08-25 19:46:27michael-osetnosy: + michael-o
messages: + msg324097
2018-08-25 13:40:42Michael.Feltsetmessages: + msg324067
2018-08-25 13:38:20Michael.Feltsetkeywords: + patch
stage: patch review
pull_requests: + pull_request8396
2018-08-24 10:28:43Michael.Feltsetmessages: + msg323996
2018-08-23 17:14:25Michael.Feltsetmessages: + msg323961
2018-08-23 10:51:30vstinnersetmessages: + msg323942
2018-08-23 10:48:50vstinnersetnosy: + vstinner
messages: + msg323941
2018-08-21 14:56:54lukasz.langasetkeywords: + 3.7regression
2018-08-21 14:56:17lukasz.langasetdependencies: + test_cmd_line test_utf8_mode test_warnings fail in all FreeBSD 3.x (3.8) buildbots
2018-08-21 14:55:43lukasz.langasetnosy: + lukasz.langa
messages: + msg323831
2018-08-09 11:55:07Michael.Feltsetmessages: + msg323319
2018-08-07 20:23:35Michael.Feltsetmessages: + msg323250
2018-08-06 20:26:57Michael.Feltsetfiles: + pEpkey.asc

messages: + msg323223
2018-08-06 20:10:54Michael.Feltsetmessages: + msg323222
2018-08-06 16:06:50Michael.Feltcreate