New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_utf8_mode.test_cmd_line() fails on HP-UX due to false assumptions #78584
Comments
Running from 3.7 branch on HP-UX 11.31 ia64, 32 bit, big endian.
> Traceback (most recent call last):
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 230, in test_cmd_line
> check('utf8=0', [c_arg], LC_ALL='C')
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 223, in check
> self.assertEqual(args, ascii(expected), out)
> AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
> - ['h\xc3\xa9\xe2\x82\xac']
> + ['h\udcc3\udca9\udce2\udc82\udcac']
> : roman8:['h\xc3\xa9\xe2\x82\xac']
>
>
I tried to understand the issue, but my Python knowledge is too low, especially I do not understand by a byte array "arg = 'h\xe9\u20ac'.encode('utf-8')" is passed as one arg to the forked process. I highly assume that this is related to the non-standard, default character encoding on HP-UX: https://en.wikipedia.org/wiki/HP_Roman#HP_Roman-8 (roman8). A stupid 8 bit encoding. The very same snippet on FreeBSD says:
Willing to test and modify if someone tells what to do. |
You might get more information asking questions on python-list. |
Thanks, I'll do that. Hopefully I can provide a patch for. Though, I am convinced that I have to write a custom codec for roman8 to make all at stuff work flawlessly. |
Although the default is different (i.e., roman8 versus latin1 (iso8859-1)) both HP-UX and AIX (like Windows, cp1252) this issue and bpo-33347 are related. As I mentioned in https://bugs.python.org/issue34347#msg323319 the string seen by self.get_output() is not the same string as "expected". If I recall, there may be a way to almost get the two be the same - excect "expected" is a bytes object and the value returned as CLI output is a regular string. I am thinking, maybe the "easy" way will be to add AIX, HP-UX, and others to skip this test. Rather than hard-code, do a query to see what the default is, and it it is not UTF-8 - skip the test. In any case, it seems to be broken for any system that does not have UTF-8 as default. |
It might be as simple as what I saw for AIX: diff --git a/Lib/test/test_utf8_mode.py b/Lib/test/test_utf8_mode.py
index 26e2e13ec5..3e918fd54c 100644
--- a/Lib/test/test_utf8_mode.py
+++ b/Lib/test/test_utf8_mode.py
@@ -219,6 +219,8 @@ class UTF8ModeTests(unittest.TestCase):
check('utf8', [arg_utf8])
if sys.platform == 'darwin' or support.is_android:
c_arg = arg_utf8
+ elif sys.platform.startswith("aix"):
+ c_arg = arg.decode('iso-8859-1')
else:
c_arg = arg_ascii
check('utf8=0', [c_arg], LC_ALL='C') so, adding below might be all that is needed: |
As the AIX complaint is (was once the PR merges):
And the HP-UX complaint is:
File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 223, in check
self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\udcc3\\udca9\\udce2\\udc82\\udcac']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\udcc3\udca9\udce2\udc82\udcac'] Maybe a change such as: --- a/Lib/test/test_utf8_mode.py
+++ b/Lib/test/test_utf8_mode.py
@@ -219,6 +219,8 @@ class UTF8ModeTests(unittest.TestCase):
check('utf8', [arg_utf8])
if sys.platform == 'darwin' or support.is_android:
c_arg = arg_utf8
+ elif (platform.system == "AIX") or
+ sys.platform.startswith("hp-ux"):
+ c_arg = arg.decode('iso-8859-1')
else:
c_arg = arg_ascii
check('utf8=0', [c_arg], LC_ALL='C') I mention this because it seems neither roman8 nor roman9 have 'official' iso names or alias (correct me if I am wrong). |
I think you are absoltely right.
You likely mean ASCII. Python assumes that LANG=C is ASCII which is not the case for AIX and HP-UX. Your patch looks reasonable, I will try this on Monday. The problem is that there is no roman8 codec in Python. Maybe ISO-8859-1 will do it for the test, but I am still eager to add one.
There are no ISO names because this is not an ISO encoding. This is an HP invention aka hp-roman8 (roman8, ibm-1051, r8, Cp1051). Edit: there is roman8 support: https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Lib/encodings/hp_roman8.py as well as aliases. There are a few aliases missing: cp1051, ibm1051 and hp-roman8. This needs an additonal PR. |
So I changed the test code to: diff --git a/Lib/test/test_utf8_mode.py b/Lib/test/test_utf8_mode.py
index 26e2e13ec5..d9f8a3ed8b 100644
--- a/Lib/test/test_utf8_mode.py
+++ b/Lib/test/test_utf8_mode.py
@@ -208,7 +208,7 @@ class UTF8ModeTests(unittest.TestCase):
def test_cmd_line(self):
arg = 'h\xe9\u20ac'.encode('utf-8')
arg_utf8 = arg.decode('utf-8')
- arg_ascii = arg.decode('ascii', 'surrogateescape')
+ arg_ascii = arg.decode('roman8', 'surrogateescape')
code = 'import locale, sys; print("%s:%s" % (locale.getpreferredencoding(), ascii(sys.argv[1:])))' def check(utf8_opt, expected, **kw): and the output is: Traceback (most recent call last):
File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 224, in test_cmd_line
check('utf8=0', [c_arg], LC_ALL='C')
File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check
self.assertEqual(args, ascii(expected), out)
AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']"
- ['h\xc3\xa9\xe2\x82\xac']
+ ['h\xfb\u02cb\xe3\x82\u02dc']
: roman8:['h\xc3\xa9\xe2\x82\xac'] I still don't understand that. I believe that surrogate escape only works for ASCII and nothing else. If so, this test must be skipped on HP-UX and AIX. |
Maybe skipping the test is the best thing: MS_WINDOWS = (sys.platform == 'win32')
-
+HPUX = (sys.platform.startswith('hp-ux'))
class UTF8ModeTests(unittest.TestCase):
DEFAULT_ENV = {
@@ -205,6 +205,7 @@ class UTF8ModeTests(unittest.TestCase):
self.assertEqual(out, 'UTF-8 UTF-8')
+ @unittest.skipIf(HPUX, 'test specific to Unix with ASCII default locale') |
Maybe Victor Stinner has some insights here. |
On 27/08/2018 15:22, Michael Osipov wrote:
> Traceback (most recent call last):
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 224, in test_cmd_line
> check('utf8=0', [c_arg], LC_ALL='C')
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 217, in check
> self.assertEqual(args, ascii(expected), out)
> AssertionError: "['h\\xc3\\xa9\\xe2\\x82\\xac']" != "['h\\xfb\\u02cb\\xe3\\x82\\u02dc']"
> - ['h\xc3\xa9\xe2\x82\xac']
> + ['h\xfb\u02cb\xe3\x82\u02dc']
> : roman8:['h\xc3\xa9\xe2\x82\xac']
>
> I still don't understand that.
Something I found helpful was to change: check('utf8=0', [c_arg], LC_ALL='C') to This also fails, but it shows what is being executed. Further, my 'understanding' is that ascii(whatever) is much smarter than And, while you might still consider it a 'bug', did you try using c_arg Michael (F)
|
Wow, this is pretty surprising. The very same patch for AIX works on HP-UX flawlessly: $ ./python -m test test_utf8_mode
Run tests sequentially
0:00:00 [1/1] test_utf8_mode == Tests result: SUCCESS == 1 test OK. Total duration: 2 sec 769 ms I still don't really understand why because decode() and ascii() are comparing apples and oranges to me. Michael, since you provided a decent solution would you mind to extend your patch for HP-UX? You deserve the credits. |
Now I know why this cannot with Roman 8: it contains chars which are multibyte in Unicode (UTF-8) which cannot be mapped into a 7-bit/8-bit encoding. Therefore CP1252 does not work because it has Unicode chars too. ISO-8859-1 solely consists of single byte chars. This test needs to be skipped on HP-UX. I will provide a patch for that. |
Hi, I'm the author of the UTF-8 Mode PEP (PEP-540) and its implementation. I wrote test_utf8_mode. I wasn't sure that it was a good idea to hardcode the locale encoding depending on the platform. The fact that AIX and HP-UX use different locale encoding confirms that it was a bad choice. My PR 8967 gets the locale encoding at runtime instead of hardcoding it. It should fix the test on AIX and HP-UX. To fix the test on HP-UX, I also removed the euro sign (U+20AC: €) from the test string. There is no need to test large code point: a single non-ASCII character is enough to validate the code. Michael Osipov: would you mind to test my PR on HP-UX please? |
Victor, looking to... |
It unfortunately does not:
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ git branch
> 3.6
> 3.7
> bpo-14568
> bpo-34401
> bpo-34403
> bpo-34412
> bpo-34448
> bpo-34449
> bpo-34519
> master
> test_c_locale_coercion_hpux
> * utf8_cmd_line
> $ ./python -m test test_utf8_mode
> Run tests sequentially
> 0:00:00 [1/1] test_utf8_mode
> test test_utf8_mode failed -- Traceback (most recent call last):
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 231, in test_cmd_line
> check('utf8=0', [c_arg], LC_ALL='C')
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 218, in check
> self.assertEqual(args, ascii(expected), out)
> AssertionError: "['h\\xc2\\xa7\\xc3\\xa9']" != "['h\\xf4\\xcf\\xfb\\u02cb']"
> - ['h\xc2\xa7\xc3\xa9']
> + ['h\xf4\xcf\xfb\u02cb']
> : roman8:['h\xc2\xa7\xc3\xa9']
>
> test_utf8_mode failed
>
> == Tests result: FAILURE ==
>
> 1 test failed:
> test_utf8_mode
>
> Total duration: 2 sec 921 ms
> Tests result: FAILURE |
Running off: 217af1d > $ ./python -m test test_utf8_mode
> Run tests sequentially
> 0:00:00 [1/1] test_utf8_mode
> test test_utf8_mode failed -- Traceback (most recent call last):
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 235, in test_cmd_line
> LC_ALL='C')
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 214, in check
> self.assertEqual(args, ascii(expected), out)
> AssertionError: "['h\\xa7\\xe9']" != "['h\\xcf\\xd5']"
> - ['h\xa7\xe9']
> + ['h\xcf\xd5']
> : roman8:['h\xa7\xe9']
>
> test_utf8_mode failed
>
> == Tests result: FAILURE ==
>
> 1 test failed:
> test_utf8_mode
>
> Total duration: 2 sec 997 ms
> Tests result: FAILURE |
Hum, it looks like a bug in the C library of HP-UX. It announces that the locale encoding is "roman8", but the mbstowcs() function decodes from the Latin1 encoding. The updated test uses the byte string: b'h\xa7\xe9'. The OS announces the encoding roman8, so the test expects the Unicode string: b'h\xa7\xe9'.decode('roman8') == 'h\xcf\xd5'.... but it gets 'h\xa7\xe9' which looks more like the byte string has been decoded from Latin1: b'h\xa7\xe9'.decode('latin1') == 'h\xa7\xe9'. Michael: would you mind to compile and run the attached c_locale.c test program? It sets the LC_ALL locale to C, dump locales (LC_ALL, LC_CTYPE, nl_langinfo(CODESET)), and then decode all bytes from the locale encoding (LC_CTYPE). The output should help me to understand what is the *effective* encoding of HP-UX for the C locale. You may modify the c_locale.c to replace "C" with "POSIX", to see if the behaviour is different. |
Please see here:
If you think this is a bug, I can happily report this to HPE. |
...
Well, it confirms what I expected: nl_langinfo(CODESET) announces "roman8", but mbstowcs() uses Latin1 encoding in practice. So I wrote the PR 8969 which forces the ASCII encoding in that case. I'm not sure how test_utf8_mode is supposed to be fixed in that case. Michael: you can try to apply PR 8969, and then apply manually PR 8967 patch: But I expect that with both patches, test_utf8_mode will still fail on test_cmd_line(). You can try to modify test_cmd_line() to force encoding to "ascii". What are the values of sys.getfilesystemencoding() and locale.getpreferredencoding() with the C locale with PR 8969? I expect "roman8" which can cause issue in os.fsencode()/os.fsdecode(). Maybe Python should also force ASCII here? |
Here is the output to your questions:
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ git checkout hpux_force_ascii
> Branch 'hpux_force_ascii' set up to track remote branch 'hpux_force_ascii' from 'vstinner'.
> Switched to a new branch 'hpux_force_ascii'
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ git cherry-pick 217af1d38db3e1e875180c6fa160f0fc80e46003
> [hpux_force_ascii 7ce2927185] bpo-34403, bpo-34207: Fix test_utf8_mode.test_cmd_line()
> Author: Victor Stinner <vstinner@redhat.com>
> Date: Tue Aug 28 09:35:25 2018 +0200
> 1 file changed, 20 insertions(+), 11 deletions(-)
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ export CC=/opt/aCC/bin/cc ; \
> export CXX=/opt/aCC/bin/aCC ; \
> export LDFLAGS=-L/usr/local/lib/hpux32 ; \
> export UNIX_STD=1998 ; \
> ./configure --prefix=/var/osipovmi/python37-testing --without-gcc --with-system-expat --with-pydebug --with-openssl=/opt/openssl
> ...
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ gmake -j 8
> ...
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ ./python -m test test_utf8_mode
> Run tests sequentially
> 0:00:00 [1/1] test_utf8_mode
> test test_utf8_mode failed -- Traceback (most recent call last):
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 235, in test_cmd_line
> LC_ALL='C')
> File "/var/osipovmi/cpython/Lib/test/test_utf8_mode.py", line 214, in check
> self.assertEqual(args, ascii(expected), out)
> AssertionError: "['h\\udca7\\udce9']" != "['h\\xcf\\xd5']"
> - ['h\udca7\udce9']
> + ['h\xcf\xd5']
> : roman8:['h\udca7\udce9']
>
> test_utf8_mode failed
>
> == Tests result: FAILURE ==
>
> 1 test failed:
> test_utf8_mode
>
> Total duration: 3 sec 58 ms
> Tests result: FAILURE
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ git diff
> diff --git a/Lib/test/test_utf8_mode.py b/Lib/test/test_utf8_mode.py
> index 5af35aed61..89c1f92615 100644
> --- a/Lib/test/test_utf8_mode.py
> +++ b/Lib/test/test_utf8_mode.py
> @@ -231,7 +231,7 @@ class UTF8ModeTests(unittest.TestCase):
>
> # Check that the command line is decoded from the locale encoding
> with self.subTest(encoding=encoding):
> - check('utf8=0', [arg.decode(encoding, 'surrogateescape')],
> + check('utf8=0', [arg.decode('ascii', 'surrogateescape')],
> LC_ALL='C')
>
> def test_optim_level(self):
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ ./python -m test test_utf8_mode
> Run tests sequentially
> 0:00:00 [1/1] test_utf8_mode
>
> == Tests result: SUCCESS ==
>
> 1 test OK.
>
> Total duration: 3 sec 65 ms
> Tests result: SUCCESS
>
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ LC_ALL=C ./python -X utf8=0
> Python 3.8.0a0 (heads/hpux_force_ascii:7ce2927185, Aug 28 2018, 12:43:04) [C] on hp-ux11
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import locale ; import sys
> >>> sys.getfilesystemencoding() ; locale.getpreferredencoding()
> 'hp-roman8'
> 'roman8'
> >>>
> osipovmi@blnn724x:/var/osipovmi/cpython []
> $ I cannot give a qualified answer on
|
Good, it works. I updated my PR 8969 to implement properly my idea. With this PR, on HP-UX with C or POSIX locale, Python now uses ASCII for its "filesystem encoding": sys.getfilesystemencoding() returns "ascii". Michael: can you please try my updated PR 8969?
You may also test with the current locale: ./python -m test -j0 -r If everything is good on your side, I will merge my PR. |
No time to compile for a couple of days. Stress from others wins instead. Maybe on Friday. Sent from my iPhone
|
Victor, this looks good to me:
The test_utf8_mode passes. Some other tests likely fail due to this Roman8 stuff: test_re and friends. I am analyzing the failures step by step and have already a few fixes around. Waiting for other PRs to be merged first. |
Can we backport this to 3.7 at least? |
On 28/08/2018 13:20, STINNER Victor wrote:
Seems to work well as far as AIX and test_utf8_mode (as you had already Attached is the output with LC_ALL=C in the prefix. If you were hoping Perhaps also noteworthy: root@x066:[/data/prj/python/git/python3-3.8]set | grep LC |
On 28/08/2018 20:43, Michael Felt wrote:
Previous mail ended with: == Tests result: FAILURE == 375 tests OK. 13 tests failed: 30 tests skipped: Total duration: 14 min 53 sec Without LC_ALL=C summary is (different): == Tests result: FAILURE == 376 tests OK. 10 tests failed: 32 tests skipped: Total duration: 11 min 1 sec And, rather than dangling processes, I see BrokenBarrierErrors FYI |
My policy is to focus on the master branch to support a new platform. Then add a buildbot and find a core developer to maintain this platform. See the PEP-11 for details. I would prefer to see a full test suite passing before discussing which changes should or should not be backported. I would also prefer to first see a more general discussion about who is going to support HP-UX. IMHO HP-UX is not officially supported today. My list of supported platforms: See the test_utf8_mode now pass on HP-UX, I close the issue. Please open more specific issues for other failures. You might open a meta issue to track all HP-UX issues. |
Please close, issue fixed. Thank you very much. |
You're welcome ;-) |
On 28/08/2018 23:14, STINNER Victor wrote:
So I was not the one asking. IMHO - as the PEP was new, if I understood However, like you - my goal is to get the tests passing on master, and
|
Oh, I didn't notice that you two have the same first name :-)
*My position didn't change since my last comment, same position for AIX and HP-UX: msg324289. I also updated my website to write down this policy: By the way, please don't comment issues that are closed. |
Victor, looks good to me: 0:00:26 [ 23/419/3] test_utf8_mode passed. I don't know wether it is related, but test_unicode crash dumps here: Current thread 0x00000001 (most recent call first): Is that related to your PEP? |
Please open a new issue to track this bug. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: