New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distutils._msvccompiler._get_vc_env() fails with UnicodeDecodeError if an env var is not encodable #71235
Comments
I got a random UnicodeDecodeError while trying to install a module with distutils. Traced it back and it being my name having a non-ascii character floating around in my environment. I'm including two patches: subprocess_errors_simple_1.patch simply adds 'errors="replace"' to the stdin, stdout and stderr, while subprocess_distutils_errors_1.patch fixes this a bit more thoroughly. I haven't added any tests yet, I'll add them to the proper patch when a concensus on which one to use is reached. |
I would not think errors=replace would be the right answer...that would lose information, wouldn't it? What is the data being used for that causes the problem? Normally in situations like this we'd use surrogateescape at the system boundary, but I don't know enough about distutils to know if that is the right answer. |
The data that causes the issue is my name (Émanuel), which is present in a couple of places in my environment. I also completely forgot about surrogateescape, my bad. Here are two new patches with the surrogateescape error handler instead of the replace one. |
Hi,
Please provide more context to your issue:
I don't think that it's ok to change the *default* error handler in the subprocess module. Adding encoding/errors parameters to subprocess.Popen is an old topic: see for example the issue bpo-6135. |
Oops, I thought I had written that - must have accidentally deleted it. I'm on Windows 7, my locale is fr-CA / cp1252, and the non-ascii data (my name =) comes from my environment. I discussed a bit with R. David Murray off-issue, and in the end it seems that it would be better to have subprocess accept parameters, but not do anything by default, and then have distutils pass these parameters. Here is a new patch. |
Copy of https://gist.github.com/Vgr255/67fc0245056454d6a68608790a9d417b E:\LGP_Test\LGP-0.1>py setup.py install
running install
running build
running build_ext
building 'lgp' extension
Traceback (most recent call last):
File "setup.py", line 10, in <module>
Extension("lgp", ["src/lgpmodule.c"]),
File "E:\Python\3.5.0\lib\distutils\core.py", line 148, in setup
dist.run_commands()
File "E:\Python\3.5.0\lib\distutils\dist.py", line 955, in run_commands
self.run_command(cmd)
File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "E:\Python\3.5.0\lib\distutils\command\install.py", line 539, in run
self.run_command('build')
File "E:\Python\3.5.0\lib\distutils\cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "E:\Python\3.5.0\lib\distutils\command\build.py", line 135, in run
self.run_command(cmd_name)
File "E:\Python\3.5.0\lib\distutils\cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 338, in run
self.build_extensions()
File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 447, in build_
extensions
self._build_extensions_serial()
File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 472, in _build
_extensions_serial
self.build_extension(ext)
File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 532, in build_
extension
depends=ext.depends)
File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 317, in compile
self.initialize()
File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 210, in initialize vc_env = _get_vc_env(plat_spec)
File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 92, in _get_vc_env
universal_newlines=True,
File "E:\Python\3.5.0\lib\subprocess.py", line 629, in check_output
**kwargs).stdout
File "E:\Python\3.5.0\lib\subprocess.py", line 698, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "E:\Python\3.5.0\lib\subprocess.py", line 1055, in communicate
stdout = self.stdout.read()
File "E:\Python\3.5.0\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 49: chara
cter maps to <undefined> |
This function was added in Python 3.5 by the issue bpo-23970: "Adds distutils._msvccompiler for new Visual Studio versions." (VS 2015, I guess). The module distutils._msvccompiler doesn't exist in Python 3.5. |
Oops, I mean: it *only* exists in Python 3.5 (and newer, it doesn't exist in Python 3.4). |
Python 3.4 has a similar function: query_vcvarsall() in Lib/distutils/msvc9compiler.py. |
IMO adding a Popen(errors=...) parameter would be a new feature for 3.6+ only. Also, the patch does not take the error handler into account when encoding and decoding multiple PIPEs in communicate(). I think it would be better to fix this bug in Lib/distutils/_msvccompiler.py only. The equivalent bug fix would look like: out = subprocess.check_output(
...,
stderr=subprocess.STDOUT,
# No universal_newlines!
)
...
out = io.TextIOWrapper(io.BytesIO(out), errors="surrogateescape").read() Or maybe a simpler version is sufficient: (I’m not familiar enough with the use case or Windows to say.) out = out.decode("ascii", "surrogateescape") |
You are right Martin, this also fixes it, and it seems much less controversial :) Here you go. |
Ok, I now understood the issue: the distutils._msvccompiler._get_vc_env() functions runs the "C:\Program Files (x86)\Microsoft visual studio 14.0\VC\vcvarsall.bat" program in a shell and then run the "set" command in the same shell to retrieve environment variables set by the batch script. Problem: The "set" program encodes Unicode environment variables to bytes to write them into stdout, but at least one environment variable is not encodable. (It's unclear to me which encoding is used: ANSI code page? OEM code page? It doesn't matter much.) Windows provides a nice Unicode API to retrieve environment varaibles. Using the "set" program is the wrong approach to retrieve environment varaibles. I suggest to replace the "set" program with a script Python script which dumps all environment variables as UTF-8. The script would use os.environ which uses the Windows Unicode API to retrieve environment variables directly as Unicode. -- Using an error handler is the wrong option here. The root problem is that the "set" program replaces unencodable characters with junk. Trying to replace or ignore junk doesn't solve the problem. The Visual Studio environment variables can contains unencodable characters, we should be able to retrieve them as Unicode without using a very limited encoding like cp1252 codepage or any OEM code page. -- Another option is to still use the set program, but completly skip variables which contain undecodable bytes. I dislike this option :-/ |
When writing to a file or pipe, cmd's internal commands default to either the current codepage of the attached console or, for a detached process, the system locale codepage. It uses a best-fit encoding that tries to map characters to similar characters in the codepage. For example: >>> win_unicode_console.enable()
>>> os.environ['TEST©'] = '®'
>>> out = subprocess.check_output('cmd /c set test')
>>> out.decode(sys.__stdout__.encoding).strip()
'TESTc=r' You can force it to use Unicode (UTF-16LE) with the /u option: >>> out = subprocess.check_output('cmd /u /c set test')
>>> out.decode('utf-16le').strip()
'TEST©=®' subprocess could use "/u /c" instead of "/c". This would only affect the output of internal shell commands such as "dir" and "set". |
2016-05-22 11:18 GMT+02:00 Eryk Sun <report@bugs.python.org>:
Oh, I didn't know the interesting "cmd /u" command. It looks like a As a side project, we can also evaluate using "cmd /u" by default when |
New changeset 15900c612ca7 by Steve Dower in branch '3.5': New changeset bb22ae1e1bcd by Steve Dower in branch 'default': |
I went ahead and updated the subprocess call just in _msvccompiler to use cmd /u, as I like that fix. Not so keen on doing it for all subprocess(shell=True) calls, since we can't reliably predict whether the output will actually respect the option. |
Oh, and before anyone asks, I used "errors='replace'" because we get all the env variables but don't use most of them. If we do end up needing one that can't be decoded, this should make it obvious, but there's no point failing early because of a variable that we're not going to use. |
vcvarsall.bat is mostly |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: