classification
Title: distutils._msvccompiler._get_vc_env() fails with UnicodeDecodeError if an env var is not encodable
Type: behavior Stage: resolved
Components: Distutils, Library (Lib), Unicode, Windows Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steve.dower Nosy List: dstufft, ebarry, eric.araujo, eryksun, ezio.melotti, martin.panter, paul.moore, python-dev, r.david.murray, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords: patch

Created on 2016-05-17 20:09 by ebarry, last changed 2016-06-17 19:38 by eryksun. This issue is now closed.

Files
File name Uploaded Description Edit
subprocess_errors_simple_1.patch ebarry, 2016-05-17 20:09 review
subprocess_distutils_errors_1.patch ebarry, 2016-05-17 20:09 review
subprocess_errors_simple_2.patch ebarry, 2016-05-17 20:38 review
subprocess_distutils_errors_2.patch ebarry, 2016-05-17 20:38 review
subprocess_distutils_errors_3.patch ebarry, 2016-05-17 21:27 review
distutils_windows_non_ascii_1.patch ebarry, 2016-05-17 23:21 review
Messages (18)
msg265774 - (view) Author: Emanuel Barry (ebarry) * (Python triager) Date: 2016-05-17 20:09
I got a random UnicodeDecodeError while trying to install a module with distutils. Traced it back and it being my name having a non-ascii character floating around in my environment.

I'm including two patches: subprocess_errors_simple_1.patch simply adds 'errors="replace"' to the stdin, stdout and stderr, while subprocess_distutils_errors_1.patch fixes this a bit more thoroughly.

I haven't added any tests yet, I'll add them to the proper patch when a concensus on which one to use is reached.
msg265775 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-05-17 20:33
I would not think errors=replace would be the right answer...that would lose information, wouldn't it?  What is the data being used for that causes the problem?  Normally in situations like this we'd use surrogateescape at the system boundary, but I don't know enough about distutils to know if that is the right answer.
msg265776 - (view) Author: Emanuel Barry (ebarry) * (Python triager) Date: 2016-05-17 20:38
The data that causes the issue is my name (Émanuel), which is present in a couple of places in my environment. I also completely forgot about surrogateescape, my bad. Here are two new patches with the surrogateescape error handler instead of the replace one.
msg265777 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 21:17
Hi,

> I got a random UnicodeDecodeError while trying to install a module with distutils.

Please provide more context to your issue:

* What is your operating system?
* What is your locale? (locale encoding on UNIX, ANSI and OEM code pages on Windows)
* Where does the data come from?
* etc.

> subprocess_errors_simple_1.patch 

I don't think that it's ok to change the *default* error handler in the subprocess module.

Adding encoding/errors parameters to subprocess.Popen is an old topic: see for example the issue #6135.
msg265778 - (view) Author: Emanuel Barry (ebarry) * (Python triager) Date: 2016-05-17 21:27
Oops, I thought I had written that - must have accidentally deleted it. I'm on Windows 7, my locale is fr-CA / cp1252, and the non-ascii data (my name =) comes from my environment.

I discussed a bit with R. David Murray off-issue, and in the end it seems that it would be better to have subprocess accept parameters, but not do anything by default, and then have distutils pass these parameters.

Here is a new patch.
msg265781 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 22:30
Copy of https://gist.github.com/Vgr255/67fc0245056454d6a68608790a9d417b

E:\LGP_Test\LGP-0.1>py setup.py install
running install
running build
running build_ext
building 'lgp' extension
Traceback (most recent call last):
  File "setup.py", line 10, in <module>
    Extension("lgp", ["src/lgpmodule.c"]),
  File "E:\Python\3.5.0\lib\distutils\core.py", line 148, in setup
    dist.run_commands()
  File "E:\Python\3.5.0\lib\distutils\dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command
    cmd_obj.run()
  File "E:\Python\3.5.0\lib\distutils\command\install.py", line 539, in run
    self.run_command('build')
  File "E:\Python\3.5.0\lib\distutils\cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command
    cmd_obj.run()
  File "E:\Python\3.5.0\lib\distutils\command\build.py", line 135, in run
    self.run_command(cmd_name)
  File "E:\Python\3.5.0\lib\distutils\cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command
    cmd_obj.run()
  File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 338, in run
    self.build_extensions()
  File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 447, in build_
extensions
    self._build_extensions_serial()
  File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 472, in _build
_extensions_serial
    self.build_extension(ext)
  File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 532, in build_
extension
    depends=ext.depends)
  File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 317, in compile
    self.initialize()
  File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 210, in initialize

    vc_env = _get_vc_env(plat_spec)
  File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 92, in _get_vc_env

    universal_newlines=True,
  File "E:\Python\3.5.0\lib\subprocess.py", line 629, in check_output
    **kwargs).stdout
  File "E:\Python\3.5.0\lib\subprocess.py", line 698, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "E:\Python\3.5.0\lib\subprocess.py", line 1055, in communicate
    stdout = self.stdout.read()
  File "E:\Python\3.5.0\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 49: chara
cter maps to <undefined>
msg265783 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 22:37
>  File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 92, in _get_vc_env
>    universal_newlines=True,

This function was added in Python 3.5 by the issue #23970: "Adds distutils._msvccompiler for new Visual Studio versions." (VS 2015, I guess). The module distutils._msvccompiler doesn't exist in Python 3.5.
msg265784 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 22:41
> The module distutils._msvccompiler doesn't exist in Python 3.5.

Oops, I mean: it *only* exists in Python 3.5 (and newer, it doesn't exist in Python 3.4).
msg265786 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 22:51
Python 3.4 has a similar function: query_vcvarsall() in Lib/distutils/msvc9compiler.py.
msg265788 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-05-17 23:07
IMO adding a Popen(errors=...) parameter would be a new feature for 3.6+ only. Also, the patch does not take the error handler into account when encoding and decoding multiple PIPEs in communicate().

I think it would be better to fix this bug in Lib/distutils/_msvccompiler.py only. The equivalent bug fix would look like:

out = subprocess.check_output(
    ...,
    stderr=subprocess.STDOUT,
    # No universal_newlines!
)
...
out = io.TextIOWrapper(io.BytesIO(out), errors="surrogateescape").read()

Or maybe a simpler version is sufficient: (I’m not familiar enough with the use case or Windows to say.)

out = out.decode("ascii", "surrogateescape")
msg265790 - (view) Author: Emanuel Barry (ebarry) * (Python triager) Date: 2016-05-17 23:21
You are right Martin, this also fixes it, and it seems much less controversial :) Here you go.
msg265791 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 23:27
Ok, I now understood the issue: the distutils._msvccompiler._get_vc_env() functions runs the "C:\Program Files (x86)\Microsoft visual studio 14.0\VC\vcvarsall.bat" program in a shell and then run the "set" command in the same shell to retrieve environment variables set by the batch script.

Problem: The "set" program encodes Unicode environment variables to bytes to write them into stdout, but at least one environment variable is not encodable. (It's unclear to me which encoding is used: ANSI code page? OEM code page? It doesn't matter much.)

Windows provides a nice Unicode API to retrieve environment varaibles. Using the "set" program is the wrong approach to retrieve environment varaibles.

I suggest to replace the "set" program with a script Python script which dumps all environment variables as UTF-8. The script would use os.environ which uses the Windows Unicode API to retrieve environment variables directly as Unicode.

--

Using an error handler is the wrong option here. The root problem is that the "set" program replaces unencodable characters with junk. Trying to replace or ignore junk doesn't solve the problem. The Visual Studio environment variables can contains unencodable characters, we should be able to retrieve them as Unicode without using a very limited encoding like cp1252 codepage or any OEM code page.

--

Another option is to still use the set program, but completly skip variables which contain undecodable bytes. I dislike this option :-/
msg266062 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-05-22 09:18
When writing to a file or pipe, cmd's internal commands default to either the current codepage of the attached console or, for a detached process, the system locale codepage. It uses a best-fit encoding that tries to map characters to similar characters in the codepage. For example:

    >>> win_unicode_console.enable()
    >>> os.environ['TEST©'] = '®'
    >>> out = subprocess.check_output('cmd /c set test')
    >>> out.decode(sys.__stdout__.encoding).strip()
    'TESTc=r'

You can force it to use Unicode (UTF-16LE) with the /u option:

    >>> out = subprocess.check_output('cmd /u /c set test')
    >>> out.decode('utf-16le').strip()
    'TEST©=®'

subprocess could use "/u /c" instead of "/c". This would only affect the output of internal shell commands such as "dir" and "set".
msg266071 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-22 12:18
2016-05-22 11:18 GMT+02:00 Eryk Sun <report@bugs.python.org>:
> subprocess could use "/u /c" instead of "/c". This would only affect the output of internal shell commands such as "dir" and "set".

Oh, I didn't know the interesting "cmd /u" command. It looks like a
short and working fix for the "set" issue.

As a side project, we can also evaluate using "cmd /u" by default when
shell=True is used with subprocess on Windows.
msg268732 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-06-17 16:33
New changeset 15900c612ca7 by Steve Dower in branch '3.5':
Issue #27048: Prevents distutils failing on Windows when environment variables contain non-ASCII characters
https://hg.python.org/cpython/rev/15900c612ca7

New changeset bb22ae1e1bcd by Steve Dower in branch 'default':
Issue #27048: Prevents distutils failing on Windows when environment variables contain non-ASCII characters
https://hg.python.org/cpython/rev/bb22ae1e1bcd
msg268733 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-06-17 16:36
I went ahead and updated the subprocess call just in _msvccompiler to use cmd /u, as I like that fix.

Not so keen on doing it for all subprocess(shell=True) calls, since we can't reliably predict whether the output will actually respect the option.
msg268734 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-06-17 16:37
Oh, and before anyone asks, I used "errors='replace'" because we get all the env variables but don't use most of them. If we do end up needing one that can't be decoded, this should make it obvious, but there's no point failing early because of a variable that we're not going to use.
msg268737 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-06-17 19:38
vcvarsall.bat is mostly `set` and `echo` statements, which print using UTF-16LE with "/u". It also runs reg.exe, but with stdout and stderr redirected to nul, so that's no problem. The final `set` command prints cmd's UTF-16LE environment. Using errors='replace' shouldn't be necessary. It doesn't hurt, however.
History
Date User Action Args
2016-06-17 19:38:41eryksunsetmessages: + msg268737
2016-06-17 16:37:47steve.dowersetmessages: + msg268734
2016-06-17 16:36:07steve.dowersetstatus: open -> closed
messages: + msg268733

assignee: steve.dower
resolution: fixed
stage: patch review -> resolved
2016-06-17 16:33:51python-devsetnosy: + python-dev
messages: + msg268732
2016-05-22 12:18:32vstinnersetmessages: + msg266071
2016-05-22 09:18:54eryksunsetnosy: + eryksun
messages: + msg266062
2016-05-17 23:27:45vstinnersetmessages: + msg265791
2016-05-17 23:21:38ebarrysetfiles: + distutils_windows_non_ascii_1.patch

messages: + msg265790
2016-05-17 23:20:30vstinnersettitle: Breakages in subprocess.Popen with non-ascii characters in environment -> distutils._msvccompiler._get_vc_env() fails with UnicodeDecodeError if an env var is not encodable
2016-05-17 23:07:38martin.pantersetnosy: + martin.panter
messages: + msg265788
2016-05-17 22:51:45vstinnersetmessages: + msg265786
2016-05-17 22:41:13vstinnersetmessages: + msg265784
2016-05-17 22:38:04vstinnersetnosy: + dstufft, paul.moore, tim.golden, eric.araujo, zach.ware, steve.dower

components: + Distutils, Windows
versions: + Python 3.5
2016-05-17 22:37:48vstinnersetmessages: + msg265783
2016-05-17 22:30:16vstinnersetmessages: + msg265781
2016-05-17 21:27:48ebarrysetfiles: + subprocess_distutils_errors_3.patch

messages: + msg265778
2016-05-17 21:17:13vstinnersetmessages: + msg265777
2016-05-17 20:38:34ebarrysetfiles: + subprocess_distutils_errors_2.patch
2016-05-17 20:38:24ebarrysetfiles: + subprocess_errors_simple_2.patch

messages: + msg265776
2016-05-17 20:33:28r.david.murraysetnosy: + r.david.murray
messages: + msg265775
2016-05-17 20:09:20ebarrysetfiles: + subprocess_distutils_errors_1.patch
2016-05-17 20:09:12ebarrycreate