classification
Title: Many regtest failures on Windows with non-ASCII account name
Type: behavior Stage:
Components: Tests, Unicode, Windows Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: eryksun, ezio.melotti, minghua, paul.moore, serhiy.storchaka, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2021-09-12 09:50 by minghua, last changed 2021-09-21 16:32 by steve.dower.

Messages (9)
msg401659 - (view) Author: Ming Hua (minghua) Date: 2021-09-12 09:50
Background:
Since at least Windows 8, it is possible to invoke the input method engine (IME) when installing Windows and creating accounts.  So at least among simplified Chinese users, it's not uncommon to have a Chinese account name.

Issue:
After successful installation using the 64-bit .exe installer for Windows, just to be paranoid (and to get familiar with Python's test framework), I decided to run the bundled regression tests.  To my surprise I got many failures.  The following is the summary of "python.exe -m test" with 3.8 some months ago (likely 3.8.6):

371 tests OK.

11 tests failed:
    test_cmd_line_script test_compileall test_distutils test_doctest
    test_locale test_mimetypes test_py_compile test_tabnanny
    test_urllib test_venv test_zipimport_support

43 tests skipped:
    test_asdl_parser test_check_c_globals test_clinic test_curses
    test_dbm_gnu test_dbm_ndbm test_devpoll test_epoll test_fcntl
    test_fork1 test_gdb test_grp test_ioctl test_kqueue
    test_multiprocessing_fork test_multiprocessing_forkserver test_nis
    test_openpty test_ossaudiodev test_pipes test_poll test_posix
    test_pty test_pwd test_readline test_resource test_smtpnet
    test_socketserver test_spwd test_syslog test_threadsignals
    test_timeout test_tix test_tk test_ttk_guionly test_urllib2net
    test_urllibnet test_wait3 test_wait4 test_winsound test_xmlrpc_net
    test_xxtestfuzz test_zipfile64

Total duration: 59 min 49 sec
Tests result: FAILURE

The failures all look similar though, it seems Python on Windows assumes the home directory of the user, "C:\Users\<username>\", is either in ASCII or UTF-8 encoding, while it is actually in Windows native codepage, in my case cp936 for simplified Chinese (zh-CN).

To take a couple of examples (these are from recent testing with 3.10.0 rc2):

> python.exe -m test -W test_cmd_line_script
0:00:03 Run tests sequentially
0:00:03 [1/1] test_cmd_line_script
[...]
test_consistent_sys_path_for_direct_execution (test.test_cmd_line_script.CmdLineTest) ... ERROR
[...]
test_directory_error (test.test_cmd_line_script.CmdLineTest) ... FAIL
[...]
ERROR: test_consistent_sys_path_for_direct_execution (test.test_cmd_line_script.CmdLineTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Programs\Python\python310\lib\test\test_cmd_line_script.py", line 677, in test_consistent_sys_path_for_direct_execution
    out_by_name = kill_python(p).decode().splitlines()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 9: invalid start byte
[...]
FAIL: test_directory_error (test.test_cmd_line_script.CmdLineTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Programs\Python\python310\lib\test\test_cmd_line_script.py", line 268, in test_directory_error
    self._check_import_error(script_dir, msg)
  File "C:\Programs\Python\python310\lib\test\test_cmd_line_script.py", line 151, in _check_import_error
    self.assertIn(expected_msg.encode('utf-8'), err)
AssertionError: b"can't find '__main__' module in 'C:\\\\Users\\\\\xe5<5 bytes redacted>\\\\AppData\\\\Local\\\\Temp\\\\tmpcwkfn9ct'" not found in b"C:\\Programs\\Python\\python310\\python.exe: can't find '__main__' module in 'C:\\\\Users\\\\\xbb<3 bytes redacted>\\\\AppData\\\\Local\\\\Temp\\\\tmpcwkfn9ct'\r\n"
[...]
----------------------------------------------------------------------
Ran 44 tests in 29.769s

FAILED (failures=2, errors=5)
test test_cmd_line_script failed
test_cmd_line_script failed (5 errors, 2 failures) in 30.4 sec

== Tests result: FAILURE ==

In the above test_directory_error AssertionError message I redacted part of the path as my account name is my real name.  Hope the issue is clear enough despite the redaction, since the "\xe5<5 bytes redacted>" part is 6 bytes and apparently in UTF-8 (for two Chinese characters) and the "\xbb<3 bytes redacted>" part is 4 bytes and apparently in cp936.

Postscript:
As I've said above, I discovered this issue some time ago, but only have time now to report it.  I believe I've see these failures in 3.8.2/6, 3.9.7, and 3.10.0 rc2.  It shouldn't be hard to reproduce for people with ways to create account with non-ASCII name on Windows.  If reproducing turns out to be difficult though, I'm happy to provide more information and/or run more tests.
msg401663 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-09-12 11:30
In Windows, the standard I/O encoding of the spawn_python() child defaults to the process active code page, i.e. GetACP(). In Windows 10, the active code page can be set to UTF-8 at the system or application level, but most systems and applications still use a legacy code page. Python's default can be overridden to UTF-8 for standard I/O via PYTHONIOENCODING, or for all I/O via PYTHONUTF8 or "-X utf8=1". I would recommend using one of these UTF-8 options instead of trying to make a test work with the legacy code page. There is no guarantee, and should be no guarantee, that a filesystem path, which is Unicode, can be encoded using a legacy code page.
msg401665 - (view) Author: Ming Hua (minghua) Date: 2021-09-12 11:57
Eryk Sun (eryksun) posted:
> Python's default can be overridden to UTF-8 for standard I/O via PYTHONIOENCODING, or for all I/O via PYTHONUTF8 or "-X utf8=1".

FWIW, I did test with "-X utf8" option and it wasn't any better.  Just tested "python.exe -X utf8=1 -m test -W test_cmd_line_script" with 3.10.0 rc2 again, and got 6 errors and 2 failures this way (1 more error than without "-X utf8=1").  There is also this new error message:

0:00:01 [1/1] test_cmd_line_script
Warning -- Uncaught thread exception: UnicodeDecodeError
Exception in thread Thread-60 (_readerthread):
Traceback (most recent call last):
  File "C:\Programs\Python\python310\lib\threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "C:\Programs\Python\python310\lib\threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Programs\Python\python310\lib\subprocess.py", line 1494, in _readerthread
    buffer.append(fh.read())
  File "C:\Programs\Python\python310\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 69: invalid start byte
msg401667 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-09-12 12:32
> FWIW, I did test with "-X utf8" option

I was suggesting to modify the tests to use the UTF-8 mode option in the spawn_python() command line. It doesn't help to run the parent process in UTF-8 mode since it isn't inherited. It could be inherited via PYTHONUTF8, but it turns out that environment variables won't help in this case due to the use of the -E and -I command-line options.
msg402250 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-09-20 15:57
I'd guess that these tests are assuming that sys.executable contains only ASCII characters. All the tests run in a non-ASCII working directory, so it's only the runtime that is not tested propersy here.

The easiest way for Ming Hua to test this is to install for all users (into Program Files), and run tests with the same user account.

If that's the case, we probably have to just go through the tests and make them Unicode-aware.
msg402288 - (view) Author: Ming Hua (minghua) Date: 2021-09-21 06:30
Steve Dower (steve.dower) posted:
> I'd guess that these tests are assuming that sys.executable contains only ASCII characters. All the tests run in a non-ASCII working directory, so it's only the runtime that is not tested propersy here.
> 
> The easiest way for Ming Hua to test this is to install for all users (into Program Files), and run tests with the same user account.

I've already installed for all users, just not into the default "C:\Program Files\", but instead "C:\Programs\Python\".  I don't think it's the executable's path that is problematic, but the temporary directory where the tests are run (%LOCALAPPDATA%\Temp\tmpcwkfn9ct, where %LOCALAPPDATA% is C:\Users\<account name>\AppData\Local and therefore contains non-ASCII characters).

Both of these paths are shown in the error/failure logs posted in the first message.

I doubt installing into "C:\Program Files\" would make a difference.
msg402305 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-09-21 09:53
Not only sys.executable. Sources of non-ASCII paths:

* sys.executable
* __file__ of the stdlib or test modules
* the current working directory
* the temporary directory

The last one is the most common in these failures.

Tests fail when a non-ASCII path is written to the stdout or a file with the default encoding (which differs from the filesystem encoding) and then read with implying:

* the ASCII encoding
* the UTF-8 encoding
* the filesystem encoding

Fixing tests is not enough, because it is often an issue of scripts which write paths to the stdout. This problem does not have simple and general solution.
msg402314 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-09-21 11:54
I see no problem with changing a test -- such as test_consistent_sys_path_for_direct_execution() -- to spawn the child interpreter with `-X utf8` when the I/O encoding itself is irrelevant to the test -- except for forcing a common Unicode encoding to ensure the integrity of test data (i.e. no mojibake) and prevent encoding/decoding failures.
msg402324 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-09-21 16:32
> I've already installed for all users, just not into the default "C:\Program Files\", but instead "C:\Programs\Python\"

Ah yes, that indeed rules out my first suspicion.

> Fixing tests is not enough, because it is often an issue of scripts which write paths to the stdout.

Sure, but the ones currently failing here are ours, so we are the ones who need to fix them :) And they all seem to be in our test suite.

Fixing the tests doesn't make all the problems go away, just the specific ones we are responsible for on this issue.
History
Date User Action Args
2021-09-21 16:32:42steve.dowersetmessages: + msg402324
2021-09-21 11:54:47eryksunsetmessages: + msg402314
2021-09-21 09:53:19serhiy.storchakasetmessages: + msg402305
2021-09-21 06:30:42minghuasetmessages: + msg402288
2021-09-20 15:57:12steve.dowersetmessages: + msg402250
2021-09-18 01:21:16terry.reedysetnosy: + zach.ware, paul.moore, tim.golden, steve.dower
components: + Windows
2021-09-12 12:32:37eryksunsetmessages: + msg401667
2021-09-12 11:57:48minghuasetmessages: + msg401665
2021-09-12 11:30:27eryksunsetnosy: + eryksun
messages: + msg401663
2021-09-12 11:18:03serhiy.storchakasetassignee: serhiy.storchaka
2021-09-12 10:21:49serhiy.storchakasetnosy: + ezio.melotti, vstinner, serhiy.storchaka

type: behavior
components: + Unicode
versions: + Python 3.11, - Python 3.8
2021-09-12 09:50:10minghuacreate