Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIX: test_utf8_mode.test_cmd_line fails #78528

Closed
aixtools opened this issue Aug 6, 2018 · 18 comments
Closed

AIX: test_utf8_mode.test_cmd_line fails #78528

aixtools opened this issue Aug 6, 2018 · 18 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@aixtools
Copy link
Contributor

aixtools commented Aug 6, 2018

BPO 34347
Nosy @vstinner, @ambv, @aixtools, @michael-o
PRs
  • bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX #8923
  • [3.7] bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX (GH-8923) #14233
  • Dependencies
  • bpo-34207: test_cmd_line test_utf8_mode test_warnings fail in all FreeBSD 3.x (3.8) buildbots
  • Files
  • pEpkey.asc
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-08-31.14:17:50.382>
    created_at = <Date 2018-08-06.16:06:50.873>
    labels = ['interpreter-core', '3.8', 'type-bug', 'tests', '3.7']
    title = 'AIX: test_utf8_mode.test_cmd_line fails'
    updated_at = <Date 2019-06-19.20:07:50.320>
    user = 'https://github.com/aixtools'

    bugs.python.org fields:

    activity = <Date 2019-06-19.20:07:50.320>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-08-31.14:17:50.382>
    closer = 'vstinner'
    components = ['Interpreter Core', 'Tests']
    creation = <Date 2018-08-06.16:06:50.873>
    creator = 'Michael.Felt'
    dependencies = ['34207']
    files = ['47733']
    hgrepos = []
    issue_num = 34347
    keywords = ['patch', '3.7regression']
    message_count = 18.0
    messages = ['323214', '323222', '323223', '323250', '323319', '323831', '323941', '323942', '323961', '323996', '324067', '324097', '324179', '324181', '324409', '324419', '337636', '346078']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'lukasz.langa', 'Michael.Felt', 'michael-o']
    pr_nums = ['8923', '14233']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue34347'
    versions = ['Python 3.7', 'Python 3.8']

    @aixtools
    Copy link
    Contributor Author

    aixtools commented Aug 6, 2018

    The test fails because

    byte_str.decode('ascii', 'surragateescape')

    is not what ascii(byte_str) - returns when called from the commandline.

    Assumption: since " check('utf8', [arg_utf8])" succeeds I assume the parsing of the command-line is correct.

    DETAILS
    >>> arg = 'h\xe9\u20ac'.encode('utf-8')
    >>> arg
    b'h\xc3\xa9\xe2\x82\xac'
    
    >>> arg.decode('ascii', 'surrogateescape')
    'h\udcc3\udca9\udce2\udc82\udcac'

    I am having a difficult time getting the syntax correct for all the "escapes", so I added a print statement in the check routine:

    test_cmd_line (test.test_utf8_mode.UTF8ModeTests) ...
    code:import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:]))) arg:b'h\xc3\xa9\xe2\x82\xac'
    out:UTF-8:['h\xe9\u20ac']

    code:import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:]))) arg:b'h\xc3\xa9\xe2\x82\xac'
    out:ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    test code with my debug statement (to generate above):

        def test_cmd_line(self):
            arg = 'h\xe9\u20ac'.encode('utf-8')
            arg_utf8 = arg.decode('utf-8')
            arg_ascii = arg.decode('ascii', 'surrogateescape')
            code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
    
            def check(utf8_opt, expected, **kw):
                out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
                print("\ncode:%s arg:%s\nout:%s" % (code, arg, out))
                args = out.partition(':')[2].rstrip()
                self.assertEqual(args, ascii(expected), out)
    
            check('utf8', [arg_utf8])
            if sys.platform == 'darwin' or support.is_android:
                c_arg = arg_utf8
            else:
                c_arg = arg_ascii
            check('utf8=0', [c_arg], LC_ALL='C')

    So the first check succeeds:

            check('utf8', [arg_utf8])

    But the second does not:

    FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line
        check('utf8=0', [c_arg], LC_ALL='C')
      File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 218, in check
        self.assertEqual(args, ascii(expected), out)
    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
    - ['h\xc3\xa9\xe2\x82\xac']
    + ['h\udcc3\udca9\udce2\udc82\udcac']
     : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).

    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

    • ['h\xc3\xa9\xe2\x82\xac']
      + [b'h\xc3\xa9\xe2\x82\xac']
      ? +
      : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    @aixtools aixtools added 3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error labels Aug 6, 2018
    @aixtools
    Copy link
    Contributor Author

    aixtools commented Aug 6, 2018

    In short, I do not understand how this passes on Linux.

    This is python3-3.4.6 on sles12:

    >>> 'h\xe9\u20ac'.encode('utf-8')
    b'h\xc3\xa9\xe2\x82\xac'
    >>> ascii('h\xe9\u20ac'.encode('utf-8'))
    "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
    >>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
    'h\udcc3\udca9\udce2\udc82\udcac'
    >>>
    
    This is python3-3.7.0 on AIX:
    >>> 'h\xe9\u20ac'.encode('utf-8')
    b'h\xc3\xa9\xe2\x82\xac'
    >>> ascii('h\xe9\u20ac'.encode('utf-8'))
    "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
    >>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
    'h\udcc3\udca9\udce2\udc82\udcac'

    If I am missing something essential here - please be blunt!

    @aixtools
    Copy link
    Contributor Author

    aixtools commented Aug 6, 2018

    On 8/6/2018 10:10 PM, Michael Felt wrote:
    > Michael Felt <michael@felt.demon.nl> added the comment:
    >
    > In short, I do not understand how this passes on Linux.
    >
    > This is python3-3.4.6 on sles12:
    >
    >>>> 'h\xe9\u20ac'.encode('utf-8')
    > b'h\xc3\xa9\xe2\x82\xac'
    >>>> ascii('h\xe9\u20ac'.encode('utf-8'))
    > "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
    >>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
    > 'h\udcc3\udca9\udce2\udc82\udcac'
    > This is python3-3.7.0 on AIX:
    >>>> 'h\xe9\u20ac'.encode('utf-8')
    > b'h\xc3\xa9\xe2\x82\xac'
    >>>> ascii('h\xe9\u20ac'.encode('utf-8'))
    > "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
    >>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
    > 'h\udcc3\udca9\udce2\udc82\udcac'
    >
    > If I am missing something essential here - please be blunt!
    Also seeing the same with Windows.
    C:\Users\MICHAELFelt>python
    Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32
    bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> 'h\xe9\u20ac'.encode('utf-8')
    b'h\xc3\xa9\xe2\x82\xac'
    >>> ascii('h\xe9\u20ac'.encode('utf-8'))
    "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
    >>> 'h\xe9\u20ac'.encode('utf-8').decode('ascii','surrogateescape')
    'h\udcc3\udca9\udce2\udc82\udcac'
    >>>
    >
    > 


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue34347\>


    @aixtools
    Copy link
    Contributor Author

    aixtools commented Aug 7, 2018

    Common "experts" - feedback needed!

    Original
    test test_utf8_mode failed -- Traceback (most recent call last):
      File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line
        check('utf8=0', [c_arg], LC_ALL='C')
      File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 217, in check
        self.assertEqual(args, ascii(expected), out)
    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
    - ['h\xc3\xa9\xe2\x82\xac']
    + ['h\udcc3\udca9\udce2\udc82\udcac']
     : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    Modification #1:
    if sys.platform == 'darwin' or support.is_android:
    c_arg = arg_utf8
    elif sys.platform.startswith("aix"):
    c_arg = arg_ascii.encode('utf-8', 'surrogateescape')
    else:
    c_arg = arg_ascii
    check('utf8=0', [c_arg], LC_ALL='C')

    Result:
    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

    • ['h\xc3\xa9\xe2\x82\xac']
      + [b'h\xc3\xa9\xe2\x82\xac']
      ? +
      : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    Modifiction #2:
    if sys.platform == 'darwin' or support.is_android:
    c_arg = arg_utf8
    elif sys.platform.startswith("aix"):
    c_arg = arg
    else:
    c_arg = arg_ascii
    check('utf8=0', [c_arg], LC_ALL='C')

    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

    • ['h\xc3\xa9\xe2\x82\xac']
      + [b'h\xc3\xa9\xe2\x82\xac']
      ? +
      : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    The "expected" continues to be a "bytes" object, while the CLI code returns a non-byte string.
    Or - the original has an ascii string object but uses \udc rather than \x

    \udc is common (i.e., I see it frequently in googled results on other things) - should something in ascii() be changed to output \udc rather than \x ?

    Thx!

    @aixtools
    Copy link
    Contributor Author

    aixtools commented Aug 9, 2018

    Starting this discussion again. Please take time to read. I have spent hours trying to understand what is failing. Please spend a few minutes reading.

    Sadly, there is a lot of text - but I do not know what I could leave out without damaging the process of discovery.

    The failing result is:

    self.assertEqual(args, ascii(expected), out)
    

    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"

    • ['h\xc3\xa9\xe2\x82\xac']
      + ['h\udcc3\udca9\udce2\udc82\udcac']
      : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    The test code is:
    +207 @unittest.skipIf(MS_WINDOWS, 'test specific to Unix')
    +208 def test_cmd_line(self):
    +209 arg = 'h\xe9\u20ac'.encode('utf-8')
    +210 arg_utf8 = arg.decode('utf-8')
    +211 arg_ascii = arg.decode('ascii', 'surrogateescape')
    +212 code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
    +213
    +214 def check(utf8_opt, expected, **kw):
    +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
    +216 args = out.partition(':')[2].rstrip()
    +217 self.assertEqual(args, ascii(expected), out)
    +218
    +219 check('utf8', [arg_utf8])
    +220 if sys.platform == 'darwin' or support.is_android:
    +221 c_arg = arg_utf8
    +222 else:
    +223 c_arg = arg_ascii
    +224 check('utf8=0', [c_arg], LC_ALL='C')

    Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)

    Question 2: It seems that what the test is 'checking' is that object.encode('utf-8') gets decoded by ascii() based on the utf8_mode set.

    +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)

    rewrites (less indent) as:
    +215 out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)

    or
    out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)

    Finally, in Lib/test/support/script_helper.py we have
    +127 print("\n", cmd_line) # debug info, ignore
    +128 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
    +129 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    +130 env=env, cwd=cwd)

    Which gives:

    ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

    Above - utf8=1 - is successful

    ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

    Here: utf8=0 fails. The arg to the CLI is equal in both cases.
    FAIL

    ## Goiing back to check() and what does it have:
    ## Add some debug. The first line is the 'raw' expected,
    ## the second line is ascii(decoded)
    ## the final is the value extracted from get_output

    +214 def check(utf8_opt, expected, **kw):
    +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
    +216 args = out.partition(':')[2].rstrip()
    +217 print("")
    +218 print("%s: expected\n%s:ascii(expected)\n%s:out" % (expected, ascii(expected), out))
    +219 self.assertEqual(args, ascii(expected), out)

    For: utf8 mode true, it works:
    ['h▒\u20ac']: expected
    ['h\xe9\u20ac']:ascii(expected)
    UTF-8:['h\xe9\u20ac']:out

    +221 check('utf8', [arg_utf8])

    But not for utf8=0
    +226 check('utf8=0', [c_arg], LC_ALL='C')
    # note, different values for LC_ALL='C' have been tried
    ['h\udcc3\udca9\udce2\udc82\udcac']: expected
    ['h\udcc3\udca9\udce2\udc82\udcac']:ascii(expected)
    ISO8859-1:['h\xc3\xa9\xe2\x82\xac']:out

    ## re: expected and ascii(expected)
    When utf8=1 expected and ascii(expected) differ. "arg" looks different from both - but after processing by get_object() expected and out match.

    When utf8=0 there is no difference is "arg1" passed to "code".
    However, whith check - the values for both expected and ascii(expected) are identical. And, sadly, the value coming back via get_output looks nothing like 'expected'.

    In short, when utf8=1 ascii(b'h\xc3\xa9\xe2\x82\xac') becomes ['h\xe9\u20ac' which is what is desired. But when utf8=0 ascii(b'h\xc3\xa9\xe2\x82\xac') is b'h\xc3\xa9\xe2\x82\xac' not 'h\udcc3\udca9\udce2\udc82\udcac'

    Finally, when I run the command from the command line (after rewrites)

    What passes:
    ./python '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
    sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'
    UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']

    encoding is UTF-8, but the result of ascii(argv[1]) is the same as argv[1]

    ./python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(
    sys.argv[1:])))' b'h\xc3\xa9\xe2\x82\xac'

    ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

    Here, the only difference in the output is that the "UTF-8" has been changed to "ISO8859-1", i.e., I was expecting a difference is the result of ascii('bh\\xc3\\xa9\\xe2\\x82\\xac'). Instead, I see "bytes obj in", "bytes obj out" -- apparently unchanged. HOWEVER, the result returned by get_output is always different, even it is just limited to removing the 'b' quality.

    Again: test result includes:
    ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] - which is not equal to manual CLI with ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

    So, I feel the issue is not with test, but within what happens after:

    +127 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE,
    +128 stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    +129 env=env, cwd=cwd)

    Specifically: here.

    +130 with proc:
    +131 try:
    +132 out, err = proc.communicate()
    +133 finally:
    +134 proc.kill()
    +135 subprocess._cleanup()
    +136 rc = proc.returncode
    +137 err = strip_python_stderr(err)
    +138 return _PythonRunResult(rc, out, err), cmd_line

    PASS:
    ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
    0 b"UTF-8:['h\\xe9\\u20ac']\n" b''

    FAIL:
    ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']
    0 b"ISO8859-1:['h\\xc3\\xa9\\xe2\\x82\\xac']\n" b''

    Seems the 'b' quality disappears somehow with:
    +216 args = out.partition(':')[2].rstrip()

    So, maybe it is in test - in that line.

    However, this goes well beyond my comprehension of python internal workings.

    Hope this helps. Please comment.

    @ambv
    Copy link
    Contributor

    ambv commented Aug 21, 2018

    I have no idea what's going on here yet but just wanted to report that we are seeing this issue on one FreeBSD buildbot, too:

    https://buildbot.python.org/all/#/builders/124/builds/508/steps/4/logs/stdio

    I can also reproduce on CentOS 7.

    Could this be related to LC_ALL= or related environment variables?

    @vstinner
    Copy link
    Member

    I fixed bpo-34207.

    @vstinner
    Copy link
    Member

    Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().

    Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)

    Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.

    @aixtools
    Copy link
    Contributor Author

    On 23/08/2018 12:51, STINNER Victor wrote:

    STINNER Victor <vstinner@redhat.com> added the comment:

    Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().
    This is beyond my understanding atm.
    Early on I tried making the expected just be 'arg' and went from
    situation A to situation B - which looked much closer, BUT, the 'types'
    differed:

    Situaltion A (original)
    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"

    • ['h\xc3\xa9\xe2\x82\xac']
      + ['h\udcc3\udca9\udce2\udc82\udcac']
      : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).

    Situation B
    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

    • ['h\xc3\xa9\xe2\x82\xac']
      + [b'h\xc3\xa9\xe2\x82\xac']
      ? +
      : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    After further digging - to understand why it was coming as "\x encoding rather than \udc"

    I looked at what was happening here:

    out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
    becomes
    out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
    becomes
    out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)

    And finally, at the CLI becomes:
    ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

    /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar
    gv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
    UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']

    /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
    argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
    ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

    Note:
    /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
    argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac'
    ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac']

    /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
    argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac'
    ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

    root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (>
    UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

    Summary:
    a) concerned about how b'h....' becomes 'bh....'
    b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from
    self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
    determines the output and the (failed) comparison.

    > Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
    Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.

    ----------


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue34347\>


    @aixtools
    Copy link
    Contributor Author

    On 23/08/2018 19:14, Michael Felt wrote:

    Michael Felt <aixtools@felt.demon.nl> added the comment:

    On 23/08/2018 12:51, STINNER Victor wrote:
    > STINNER Victor <vstinner@redhat.com> added the comment:
    >
    > Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().
    This is beyond my understanding atm.
    Early on I tried making the expected just be 'arg' and went from
    situation A to situation B - which looked much closer, BUT, the 'types'
    differed:

    Situaltion A (original)
    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"

    • ['h\xc3\xa9\xe2\x82\xac']
    • ['h\udcc3\udca9\udce2\udc82\udcac']
      : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such).

    Situation B
    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"

    • ['h\xc3\xa9\xe2\x82\xac']
    • [b'h\xc3\xa9\xe2\x82\xac']
      ? +
      : ISO8859-1:['h\xc3\xa9\xe2\x82\xac']

    After further digging - to understand why it was coming as "\x encoding rather than \udc"

    I looked at what was happening here:

    out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
    becomes
    out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
    becomes
    out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw)

    And finally, at the CLI becomes:
    ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac']

    /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar
    gv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
    UTF-8:['bh\\xc3\\xa9\\xe2\\x82\\xac']

    /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
    argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'
    ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac']

    Note:
    /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
    argv[1:])))', 'h\udcc3\udca9\udce2\udc82\udcac'
    ISO8859-1:['h\\udcc3\\udca9\\udce2\\udc82\\udcac']

    /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.
    argv[1:])))', b'h\udcc3\udca9\udce2\udc82\udcac'
    ISO8859-1:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

    root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (>
    UTF-8:['bh\\udcc3\\udca9\\udce2\\udc82\\udcac']

    Summary:
    a) concerned about how b'h....' becomes 'bh....'
    b) whatwever argv[1] is, is very close to what is returned - so whatever happens durinf the transformation from
    self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
    determines the output and the (failed) comparison.
    p.s. also tried:
    michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python
    '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys;
    print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))',
    'h\xe9\u20ac'.encode\('utf-8'\)
    ISO8859-1:['h\\xe9\\u20ac.encode(utf-8)']
    michael@x071:[/data/prj/python/git/python3-3.8]/data/prj/python/python3-3.8/python
    '-X' 'faulthandler' '-X' 'utf8=1' '-c' 'import locale, sys;
    print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))',
    'h\xe9\u20ac'.encode\('utf-8'\)
    UTF-8:['h\\xe9\\u20ac.encode(utf-8)']

    Really unclear to me what this test is trying to verify. The CLI seems
    to just 'echo' what it is provided.
    >>> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
    >> Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.
    >>
    >> 

    >
    > _______________________________________
    > Python tracker <report@bugs.python.org>
    > <https://bugs.python.org/issue34347\>
    > _______________________________________
    >
    ----------


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue34347\>


    @aixtools
    Copy link
    Contributor Author

    Solution much simpler than I thought:

    not arg.decode('ascii', 'surrogateescape'), but arg.decode('iso-8859-1')

    @michael-o
    Copy link
    Mannequin

    michael-o mannequin commented Aug 25, 2018

    This is a very thorough analysis. Kudos to that.

    @vstinner
    Copy link
    Member

    New changeset 7ef1697 by Victor Stinner (Michael Felt) in branch 'master':
    bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX (GH-8923)
    7ef1697

    @michael-o
    Copy link
    Mannequin

    michael-o mannequin commented Aug 27, 2018

    Interesting is that the very same approach does not work for HP-UX even if I swap out the params for HP-UX:

    $ ./python -m test test_utf8_mode
    Run tests sequentially
    0:00:00 [1/1] test_utf8_mode
    test test_utf8_mode failed -- Traceback (most recent call last):
      File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 226, in test_cmd_line
        check('utf8=0', [c_arg], LC_ALL='C')
      File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check
        self.assertEqual(args, ascii(expected), out)
    AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']"
    - ['h\xc3\xa9\xe2\x82\xac']
    + ['h\xfb\u02cb\xe3\x82\u02dc']
     : roman8:['h\xc3\xa9\xe2\x82\xac']

    @aixtools
    Copy link
    Contributor Author

    The buildbots seem happy. This may be closed.

    @vstinner
    Copy link
    Member

    The buildbots seem happy. This may be closed.

    Cool, thank you for checking, and thanks for your fix! I close the issue.

    @aixtools
    Copy link
    Contributor Author

    Could this be backported to version 3.7?

    @vstinner
    Copy link
    Member

    New changeset 15e7d24 by Victor Stinner (Michael Felt) in branch '3.7':
    [3.7] bpo-34347: Fix test_utf8_mode.test_cmd_line for AIX (GH-8923) (GH-14233)
    15e7d24

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants