classification
Title: pip: open() uses the locale encoding to parse Python script, instead of the encoding cookie
Type: crash Stage: resolved
Components: Unicode Versions: Python 3.3
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: GreenKey, ezio.melotti, ncoghlan, vstinner
Priority: normal Keywords:

Created on 2013-11-21 21:21 by GreenKey, last changed 2013-11-22 13:03 by ncoghlan. This issue is now closed.

Messages (4)
msg203673 - (view) Author: Curtis Doty (GreenKey) Date: 2013-11-21 21:21
I first stumbled across this bug attempting to install use pip's cool editable mode:

$ pip install -e git+git://github.com/appliedsec/pygeoip.git#egg=pygeoip
Obtaining pygeoip from git+git://github.com/appliedsec/pygeoip.git#egg=pygeoip
  Cloning git://github.com/appliedsec/pygeoip.git to ./src/pygeoip
  Running setup.py egg_info for package pygeoip
    Traceback (most recent call last):
      File "<string>", line 16, in <module>
      File "/home/curtis/python/3.3.3/lib/python3.3/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1098: ordinal not in range(128)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 16, in <module>

  File "/home/curtis/python/3.3.3/lib/python3.3/encodings/ascii.py", line 26, in decode

    return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1098: ordinal not in range(128)

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /home/curtis/python/2013-11-20/src/pygeoip
Storing complete log in /home/curtis/.pip/pip.log


It turns out this is related to a local LANG=C environment. If I set LANG=en_US.UTF-8, the problem goes away. But it seems pip/python3 open() should be more intelligently handling this.

Worse, the file in this case https://github.com/appliedsec/pygeoip/blob/master/setup.py already has a source code decorator *declaring* it as utf-8.

Ugly workaround patch is to force pip to always use 8-bit encoding on setup.py:

--- pip.orig/req.py	2013-11-19 15:53:49.000000000 -0800
+++ pip/req.py	2013-11-20 16:37:23.642656132 -0800
@@ -281,7 +281,7 @@ def replacement_run(self):
             writer(self, ep.name, os.path.join(self.egg_info,ep.name))
     self.find_sources()
 egg_info.egg_info.run = replacement_run
-exec(compile(open(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
+exec(compile(open(__file__,encoding='utf_8').read().replace('\\r\\n', '\\n'), __file__, 'exec'))
 """
 
     def egg_info_data(self, filename):
@@ -687,7 +687,7 @@ exec(compile(open(__file__).read().repla
             ## FIXME: should we do --install-headers here too?
             call_subprocess(
                 [sys.executable, '-c',
-                 "import setuptools; __file__=%r; exec(compile(open(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))" % self.setup_py]
+                 "import setuptools; __file__=%r; exec(compile(open(__file__,encoding='utf_8').read().replace('\\r\\n', '\\n'), __file__, 'exec'))" % self.setup_py]
                 + list(global_options) + ['develop', '--no-deps'] + list(install_options),
 
                 cwd=self.source_dir, filter_stdout=self._filter_install,


But that only treats the symptom. Root cause appears to be in python3 as demonstrated by this simple script:

wrong-codec.py:
#! /bin/env python3
from urllib.request import urlretrieve
urlretrieve('https://raw.github.com/appliedsec/pygeoip/master/setup.py', filename='setup.py')

# if LANC=C then locale.py:getpreferredencoding()->'ANSI_X3.4-1968'
foo= open('setup.py')

# bang! ascii_decode() cannot handle the unicode
bar= foo.read()


This does not occur in python2. Is this bug in pip or python3?
msg203679 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-21 22:02
pip is not part of the Python standard library, you should report it upstream:
https://github.com/pypa/pip
msg203682 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-21 22:07
-exec(compile(open(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
+exec(compile(open(__file__,encoding='utf_8').read().replace('\\r\\n', '\\n'), __file__, 'exec'))

The fix is not correct, the script may use a different encoding.

Replace open() with tokenize.open(), available since Python 3.2.

.replace('\\r\\n', '\\n') is probably useless in Python 3 which uses universal newlines by default.
msg203754 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-11-22 13:03
Upstream: https://github.com/pypa/pip/pull/816
History
Date User Action Args
2013-11-22 13:03:53ncoghlansetstatus: open -> closed
resolution: not a bug
messages: + msg203754

stage: resolved
2013-11-21 22:52:49vstinnersettitle: open() fails to autodetect utf-8 if LANG=C -> pip: open() uses the locale encoding to parse Python script, instead of the encoding cookie
2013-11-21 22:07:40vstinnersetmessages: + msg203682
2013-11-21 22:02:16vstinnersetnosy: + ncoghlan
messages: + msg203679
2013-11-21 21:21:02GreenKeycreate