Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distutils._msvccompiler._get_vc_env() fails with UnicodeDecodeError if an env var is not encodable #71235

Closed
Vgr255 mannequin opened this issue May 17, 2016 · 18 comments
Closed
Assignees
Labels
OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@Vgr255
Copy link
Mannequin

Vgr255 mannequin commented May 17, 2016

BPO 27048
Nosy @pfmoore, @vstinner, @tjguk, @ezio-melotti, @merwok, @bitdancer, @vadmium, @zware, @eryksun, @zooba, @dstufft, @Vgr255
Files
  • subprocess_errors_simple_1.patch
  • subprocess_distutils_errors_1.patch
  • subprocess_errors_simple_2.patch
  • subprocess_distutils_errors_2.patch
  • subprocess_distutils_errors_3.patch
  • distutils_windows_non_ascii_1.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/zooba'
    closed_at = <Date 2016-06-17.16:36:07.104>
    created_at = <Date 2016-05-17.20:09:12.783>
    labels = ['type-bug', 'library', 'expert-unicode', 'OS-windows']
    title = 'distutils._msvccompiler._get_vc_env() fails with UnicodeDecodeError if an env var is not encodable'
    updated_at = <Date 2016-06-17.19:38:41.701>
    user = 'https://github.com/Vgr255'

    bugs.python.org fields:

    activity = <Date 2016-06-17.19:38:41.701>
    actor = 'eryksun'
    assignee = 'steve.dower'
    closed = True
    closed_date = <Date 2016-06-17.16:36:07.104>
    closer = 'steve.dower'
    components = ['Distutils', 'Library (Lib)', 'Unicode', 'Windows']
    creation = <Date 2016-05-17.20:09:12.783>
    creator = 'abarry'
    dependencies = []
    files = ['42880', '42881', '42882', '42883', '42884', '42885']
    hgrepos = []
    issue_num = 27048
    keywords = ['patch']
    message_count = 18.0
    messages = ['265774', '265775', '265776', '265777', '265778', '265781', '265783', '265784', '265786', '265788', '265790', '265791', '266062', '266071', '268732', '268733', '268734', '268737']
    nosy_count = 13.0
    nosy_names = ['paul.moore', 'vstinner', 'tim.golden', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'python-dev', 'martin.panter', 'zach.ware', 'eryksun', 'steve.dower', 'dstufft', 'abarry']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue27048'
    versions = ['Python 3.5', 'Python 3.6']

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented May 17, 2016

    I got a random UnicodeDecodeError while trying to install a module with distutils. Traced it back and it being my name having a non-ascii character floating around in my environment.

    I'm including two patches: subprocess_errors_simple_1.patch simply adds 'errors="replace"' to the stdin, stdout and stderr, while subprocess_distutils_errors_1.patch fixes this a bit more thoroughly.

    I haven't added any tests yet, I'll add them to the proper patch when a concensus on which one to use is reached.

    @Vgr255 Vgr255 mannequin added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels May 17, 2016
    @bitdancer
    Copy link
    Member

    I would not think errors=replace would be the right answer...that would lose information, wouldn't it? What is the data being used for that causes the problem? Normally in situations like this we'd use surrogateescape at the system boundary, but I don't know enough about distutils to know if that is the right answer.

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented May 17, 2016

    The data that causes the issue is my name (Émanuel), which is present in a couple of places in my environment. I also completely forgot about surrogateescape, my bad. Here are two new patches with the surrogateescape error handler instead of the replace one.

    @vstinner
    Copy link
    Member

    Hi,

    I got a random UnicodeDecodeError while trying to install a module with distutils.

    Please provide more context to your issue:

    • What is your operating system?
    • What is your locale? (locale encoding on UNIX, ANSI and OEM code pages on Windows)
    • Where does the data come from?
    • etc.

    subprocess_errors_simple_1.patch

    I don't think that it's ok to change the *default* error handler in the subprocess module.

    Adding encoding/errors parameters to subprocess.Popen is an old topic: see for example the issue bpo-6135.

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented May 17, 2016

    Oops, I thought I had written that - must have accidentally deleted it. I'm on Windows 7, my locale is fr-CA / cp1252, and the non-ascii data (my name =) comes from my environment.

    I discussed a bit with R. David Murray off-issue, and in the end it seems that it would be better to have subprocess accept parameters, but not do anything by default, and then have distutils pass these parameters.

    Here is a new patch.

    @vstinner
    Copy link
    Member

    Copy of https://gist.github.com/Vgr255/67fc0245056454d6a68608790a9d417b

    E:\LGP_Test\LGP-0.1>py setup.py install
    running install
    running build
    running build_ext
    building 'lgp' extension
    Traceback (most recent call last):
      File "setup.py", line 10, in <module>
        Extension("lgp", ["src/lgpmodule.c"]),
      File "E:\Python\3.5.0\lib\distutils\core.py", line 148, in setup
        dist.run_commands()
      File "E:\Python\3.5.0\lib\distutils\dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command
        cmd_obj.run()
      File "E:\Python\3.5.0\lib\distutils\command\install.py", line 539, in run
        self.run_command('build')
      File "E:\Python\3.5.0\lib\distutils\cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command
        cmd_obj.run()
      File "E:\Python\3.5.0\lib\distutils\command\build.py", line 135, in run
        self.run_command(cmd_name)
      File "E:\Python\3.5.0\lib\distutils\cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command
        cmd_obj.run()
      File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 338, in run
        self.build_extensions()
      File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 447, in build_
    extensions
        self._build_extensions_serial()
      File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 472, in _build
    _extensions_serial
        self.build_extension(ext)
      File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 532, in build_
    extension
        depends=ext.depends)
      File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 317, in compile
        self.initialize()
      File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 210, in initialize
        vc_env = _get_vc_env(plat_spec)
      File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 92, in _get_vc_env
    
        universal_newlines=True,
      File "E:\Python\3.5.0\lib\subprocess.py", line 629, in check_output
        **kwargs).stdout
      File "E:\Python\3.5.0\lib\subprocess.py", line 698, in run
        stdout, stderr = process.communicate(input, timeout=timeout)
      File "E:\Python\3.5.0\lib\subprocess.py", line 1055, in communicate
        stdout = self.stdout.read()
      File "E:\Python\3.5.0\lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 49: chara
    cter maps to <undefined>

    @vstinner
    Copy link
    Member

    File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 92, in _get_vc_env
    universal_newlines=True,

    This function was added in Python 3.5 by the issue bpo-23970: "Adds distutils._msvccompiler for new Visual Studio versions." (VS 2015, I guess). The module distutils._msvccompiler doesn't exist in Python 3.5.

    @vstinner vstinner added stdlib Python modules in the Lib dir OS-windows labels May 17, 2016
    @vstinner
    Copy link
    Member

    The module distutils._msvccompiler doesn't exist in Python 3.5.

    Oops, I mean: it *only* exists in Python 3.5 (and newer, it doesn't exist in Python 3.4).

    @vstinner
    Copy link
    Member

    Python 3.4 has a similar function: query_vcvarsall() in Lib/distutils/msvc9compiler.py.

    @vadmium
    Copy link
    Member

    vadmium commented May 17, 2016

    IMO adding a Popen(errors=...) parameter would be a new feature for 3.6+ only. Also, the patch does not take the error handler into account when encoding and decoding multiple PIPEs in communicate().

    I think it would be better to fix this bug in Lib/distutils/_msvccompiler.py only. The equivalent bug fix would look like:

    out = subprocess.check_output(
        ...,
        stderr=subprocess.STDOUT,
        # No universal_newlines!
    )
    ...
    out = io.TextIOWrapper(io.BytesIO(out), errors="surrogateescape").read()

    Or maybe a simpler version is sufficient: (I’m not familiar enough with the use case or Windows to say.)

    out = out.decode("ascii", "surrogateescape")

    @vstinner vstinner changed the title Breakages in subprocess.Popen with non-ascii characters in environment distutils._msvccompiler._get_vc_env() fails with UnicodeDecodeError if an env var is not encodable May 17, 2016
    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented May 17, 2016

    You are right Martin, this also fixes it, and it seems much less controversial :) Here you go.

    @vstinner
    Copy link
    Member

    Ok, I now understood the issue: the distutils._msvccompiler._get_vc_env() functions runs the "C:\Program Files (x86)\Microsoft visual studio 14.0\VC\vcvarsall.bat" program in a shell and then run the "set" command in the same shell to retrieve environment variables set by the batch script.

    Problem: The "set" program encodes Unicode environment variables to bytes to write them into stdout, but at least one environment variable is not encodable. (It's unclear to me which encoding is used: ANSI code page? OEM code page? It doesn't matter much.)

    Windows provides a nice Unicode API to retrieve environment varaibles. Using the "set" program is the wrong approach to retrieve environment varaibles.

    I suggest to replace the "set" program with a script Python script which dumps all environment variables as UTF-8. The script would use os.environ which uses the Windows Unicode API to retrieve environment variables directly as Unicode.

    --

    Using an error handler is the wrong option here. The root problem is that the "set" program replaces unencodable characters with junk. Trying to replace or ignore junk doesn't solve the problem. The Visual Studio environment variables can contains unencodable characters, we should be able to retrieve them as Unicode without using a very limited encoding like cp1252 codepage or any OEM code page.

    --

    Another option is to still use the set program, but completly skip variables which contain undecodable bytes. I dislike this option :-/

    @eryksun
    Copy link
    Contributor

    eryksun commented May 22, 2016

    When writing to a file or pipe, cmd's internal commands default to either the current codepage of the attached console or, for a detached process, the system locale codepage. It uses a best-fit encoding that tries to map characters to similar characters in the codepage. For example:

        >>> win_unicode_console.enable()
        >>> os.environ['TEST©'] = '®'
        >>> out = subprocess.check_output('cmd /c set test')
        >>> out.decode(sys.__stdout__.encoding).strip()
        'TESTc=r'

    You can force it to use Unicode (UTF-16LE) with the /u option:

        >>> out = subprocess.check_output('cmd /u /c set test')
        >>> out.decode('utf-16le').strip()
        'TEST©=®'

    subprocess could use "/u /c" instead of "/c". This would only affect the output of internal shell commands such as "dir" and "set".

    @vstinner
    Copy link
    Member

    2016-05-22 11:18 GMT+02:00 Eryk Sun <report@bugs.python.org>:

    subprocess could use "/u /c" instead of "/c". This would only affect the output of internal shell commands such as "dir" and "set".

    Oh, I didn't know the interesting "cmd /u" command. It looks like a
    short and working fix for the "set" issue.

    As a side project, we can also evaluate using "cmd /u" by default when
    shell=True is used with subprocess on Windows.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 17, 2016

    New changeset 15900c612ca7 by Steve Dower in branch '3.5':
    Issue bpo-27048: Prevents distutils failing on Windows when environment variables contain non-ASCII characters
    https://hg.python.org/cpython/rev/15900c612ca7

    New changeset bb22ae1e1bcd by Steve Dower in branch 'default':
    Issue bpo-27048: Prevents distutils failing on Windows when environment variables contain non-ASCII characters
    https://hg.python.org/cpython/rev/bb22ae1e1bcd

    @zooba
    Copy link
    Member

    zooba commented Jun 17, 2016

    I went ahead and updated the subprocess call just in _msvccompiler to use cmd /u, as I like that fix.

    Not so keen on doing it for all subprocess(shell=True) calls, since we can't reliably predict whether the output will actually respect the option.

    @zooba zooba closed this as completed Jun 17, 2016
    @zooba zooba self-assigned this Jun 17, 2016
    @zooba
    Copy link
    Member

    zooba commented Jun 17, 2016

    Oh, and before anyone asks, I used "errors='replace'" because we get all the env variables but don't use most of them. If we do end up needing one that can't be decoded, this should make it obvious, but there's no point failing early because of a variable that we're not going to use.

    @eryksun
    Copy link
    Contributor

    eryksun commented Jun 17, 2016

    vcvarsall.bat is mostly set and echo statements, which print using UTF-16LE with "/u". It also runs reg.exe, but with stdout and stderr redirected to nul, so that's no problem. The final set command prints cmd's UTF-16LE environment. Using errors='replace' shouldn't be necessary. It doesn't hurt, however.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants