New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIX: test_utf8_mode.test_cmd_line fails #78528
Comments
The test fails because byte_str.decode('ascii', 'surragateescape') is not what ascii(byte_str) - returns when called from the commandline. Assumption: since " check('utf8', [arg_utf8])" succeeds I assume the parsing of the command-line is correct. DETAILS
>>> arg = 'h\xe9\u20ac'.encode('utf-8')
>>> arg
b'h\xc3\xa9\xe2\x82\xac'
>>> arg.decode('ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac' I am having a difficult time getting the syntax correct for all the "escapes", so I added a print statement in the check routine: test_cmd_line (test.test_utf8_mode.UTF8ModeTests) ... code:import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:]))) arg:b'h\xc3\xa9\xe2\x82\xac' test code with my debug statement (to generate above): def test_cmd_line(self):
arg = 'h\xe9\u20ac'.encode('utf-8')
arg_utf8 = arg.decode('utf-8')
arg_ascii = arg.decode('ascii', 'surrogateescape')
code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))'
def check(utf8_opt, expected, **kw):
out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
print("\ncode:%s arg:%s\nout:%s" % (code, arg, out))
args = out.partition(':')[2].rstrip()
self.assertEqual(args, ascii(expected), out)
check('utf8', [arg_utf8])
if sys.platform == 'darwin' or support.is_android:
c_arg = arg_utf8
else:
c_arg = arg_ascii
check('utf8=0', [c_arg], LC_ALL='C') So the first check succeeds: check('utf8', [arg_utf8]) But the second does not: FAIL: test_cmd_line (test.test_utf8_mode.UTF8ModeTests) Traceback (most recent call last):
File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line
check('utf8=0', [c_arg], LC_ALL='C')
File "/data/prj/python/src/python3-3.7.0/Lib/test/test_utf8_mode.py", line 218, in check
self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such). AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"
|
In short, I do not understand how this passes on Linux. This is python3-3.4.6 on sles12: >>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'
>>>
This is python3-3.7.0 on AIX:
>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac' If I am missing something essential here - please be blunt! |
On 8/6/2018 10:10 PM, Michael Felt wrote:
> Michael Felt <michael@felt.demon.nl> added the comment:
>
> In short, I do not understand how this passes on Linux.
>
> This is python3-3.4.6 on sles12:
>
>>>> 'h\xe9\u20ac'.encode('utf-8')
> b'h\xc3\xa9\xe2\x82\xac'
>>>> ascii('h\xe9\u20ac'.encode('utf-8'))
> "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
> 'h\udcc3\udca9\udce2\udc82\udcac'
> This is python3-3.7.0 on AIX:
>>>> 'h\xe9\u20ac'.encode('utf-8')
> b'h\xc3\xa9\xe2\x82\xac'
>>>> ascii('h\xe9\u20ac'.encode('utf-8'))
> "b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>>> 'h\xe9\u20ac'.encode('utf-8').decode('us-ascii', 'surrogateescape')
> 'h\udcc3\udca9\udce2\udc82\udcac'
>
> If I am missing something essential here - please be blunt!
Also seeing the same with Windows.
C:\Users\MICHAELFelt>python
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32
bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'h\xe9\u20ac'.encode('utf-8')
b'h\xc3\xa9\xe2\x82\xac'
>>> ascii('h\xe9\u20ac'.encode('utf-8'))
"b'h\\xc3\\xa9\\xe2\\x82\\xac'"
>>> 'h\xe9\u20ac'.encode('utf-8').decode('ascii','surrogateescape')
'h\udcc3\udca9\udce2\udc82\udcac'
>>>
>
>
|
Common "experts" - feedback needed! Original
test test_utf8_mode failed -- Traceback (most recent call last):
File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 225, in test_cmd_line
check('utf8=0', [c_arg], LC_ALL='C')
File "/data/prj/python/git/python3-3.8/Lib/test/test_utf8_mode.py", line 217, in check
self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac']
: ISO8859-1:['h\xc3\xa9\xe2\x82\xac'] Modification #1: Result:
Modifiction #2: AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "[b'h\\xc3\\xa9\\xe2\\x82\\xac']"
The "expected" continues to be a "bytes" object, while the CLI code returns a non-byte string. \udc is common (i.e., I see it frequently in googled results on other things) - should something in ascii() be changed to output \udc rather than \x ? Thx! |
Starting this discussion again. Please take time to read. I have spent hours trying to understand what is failing. Please spend a few minutes reading. Sadly, there is a lot of text - but I do not know what I could leave out without damaging the process of discovery. The failing result is:
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
The test code is: Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252) Question 2: It seems that what the test is 'checking' is that object.encode('utf-8') gets decoded by ascii() based on the utf8_mode set. +215 out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw) rewrites (less indent) as: or Finally, in Lib/test/support/script_helper.py we have Which gives: ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] Above - utf8=1 - is successful ['/data/prj/python/python3-3.8/python', '-X', 'faulthandler', '-X', 'utf8=0', '-c', 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))', b'h\xc3\xa9\xe2\x82\xac'] Here: utf8=0 fails. The arg to the CLI is equal in both cases. ## Goiing back to check() and what does it have: +214 def check(utf8_opt, expected, **kw): For: utf8 mode true, it works: +221 check('utf8', [arg_utf8]) But not for utf8=0 ## re: expected and ascii(expected) When utf8=0 there is no difference is "arg1" passed to "code". In short, when utf8=1 ascii(b'h\xc3\xa9\xe2\x82\xac') becomes ['h\xe9\u20ac' which is what is desired. But when utf8=0 ascii(b'h\xc3\xa9\xe2\x82\xac') is b'h\xc3\xa9\xe2\x82\xac' not 'h\udcc3\udca9\udce2\udc82\udcac' Finally, when I run the command from the command line (after rewrites) What passes: encoding is UTF-8, but the result of ascii(argv[1]) is the same as argv[1] ./python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii( ISO8859-1:['bh\\xc3\\xa9\\xe2\\x82\\xac'] Here, the only difference in the output is that the "UTF-8" has been changed to "ISO8859-1", i.e., I was expecting a difference is the result of ascii('bh\\xc3\\xa9\\xe2\\x82\\xac'). Instead, I see "bytes obj in", "bytes obj out" -- apparently unchanged. HOWEVER, the result returned by get_output is always different, even it is just limited to removing the 'b' quality. Again: test result includes: So, I feel the issue is not with test, but within what happens after: +127 proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE, Specifically: here. +130 with proc: PASS: FAIL: Seems the 'b' quality disappears somehow with: So, maybe it is in test - in that line. However, this goes well beyond my comprehension of python internal workings. Hope this helps. Please comment. |
I have no idea what's going on here yet but just wanted to report that we are seeing this issue on one FreeBSD buildbot, too: https://buildbot.python.org/all/#/builders/124/builds/508/steps/4/logs/stdio I can also reproduce on CentOS 7. Could this be related to LC_ALL= or related environment variables? |
I fixed bpo-34207. |
Your issue is about decoding command line argument which is done from main() function. It doesn't use Python codecs, but functions like Py_DecodeLocale().
Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed. |
On 23/08/2018 12:51, STINNER Victor wrote:
Situaltion A (original)
I tried saying the "expected" is arg, but arg is still a byte object, the cmd_line result is not (printed as such). Situation B
After further digging - to understand why it was coming as "\x encoding rather than \udc" I looked at what was happening here: out = self.get_output('-X', utf8_opt, '-c', code, arg, **kw)
becomes
out = self.get_output('-X', utf8_opt, '-c', code, 'h\xe9\u20ac'.encode('utf-8'), **kw)
becomes
out = self.get_output('-X', utf8_opt, '-c', code, b'h\xc3\xa9\xe2\x82\xac', **kw) And finally, at the CLI becomes: /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.ar /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. Note: /data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8=0' '-c' 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys. root@x066:[/data/prj/python/python3-3.8]/data/prj/python/python3-3.8/python '-X' 'faulthandler' '-X' 'utf8' '-c' 'import locale, sys; print("%s:%s" % (> Summary:
|
On 23/08/2018 19:14, Michael Felt wrote:
Really unclear to me what this test is trying to verify. The CLI seems
to just 'echo' what it is provided.
>>> Question 1: why is windows excluded? Because it does not use UTF-8 as it's default (it's default is CP1252)
>> Windows uses wmain() which gets command line arguments as wchar_t* strings: Unicode. No decoding is needed.
>>
>>
|
Solution much simpler than I thought: not arg.decode('ascii', 'surrogateescape'), but arg.decode('iso-8859-1') |
This is a very thorough analysis. Kudos to that. |
Interesting is that the very same approach does not work for HP-UX even if I swap out the params for HP-UX: $ ./python -m test test_utf8_mode
Run tests sequentially
0:00:00 [1/1] test_utf8_mode
test test_utf8_mode failed -- Traceback (most recent call last):
File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 226, in test_cmd_line
check('utf8=0', [c_arg], LC_ALL='C')
File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check
self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\xfb\u02cb\xe3\x82\u02dc']
: roman8:['h\xc3\xa9\xe2\x82\xac'] |
The buildbots seem happy. This may be closed. |
Cool, thank you for checking, and thanks for your fix! I close the issue. |
Could this be backported to version 3.7? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: