classification
Title: distutils command build_scripts fails with UnicodeDecodeError
Type: behavior Stage: resolved
Components: Distutils, Distutils2 Versions: Python 3.1, Python 3.2, Python 3.3, Python 2.7, 3rd party
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: tarek Nosy List: Arfrever, alexis, benjamin.peterson, eric.araujo, georg.brandl, hagen, haypo, lemburg, mgorny, python-dev, tarek
Priority: release blocker Keywords: patch

Created on 2010-11-14 20:32 by hagen, last changed 2011-05-19 13:18 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
surrogateescape.patch hagen, 2010-11-14 20:32 use surrogateescape for reading and writing script files review
build_scripts-binary_mode.patch Arfrever, 2011-04-28 14:29 Use binary mode for reading and writing script files
Messages (20)
msg121207 - (view) Author: Hagen Fürstenau (hagen) Date: 2010-11-14 20:32
As suggested in issue 9561, I'm creating a new bug for the encoding problem in build_scripts: If a script file can't be decoded with the (locale dependent) standard encoding, then "build_scripts" fails with UnicodeDecodeError. Reproducable e.g. with LANG=C and a script file containing non ASCII chars near the beginning (so that they're read on a single readline()).

Attaching a patch that uses "surrogateescape", as proposed for issue 6011.
msg134630 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-04-27 23:38
I’m not sure how I feel about using surrogateescape.  The distutils source is very similar across 2.7, 3.1, 3.2 and default, especially after the Great Revert and freeze last year to restore buggy-but-known behavior while the distutils2 project was created and allowed to fix things and break stuff.  Haypo added a fix using surrogateescape in 3.2, so it couldn’t be backported to all stable branches.  You may say that at least it was fixed in one version, which is something good.  I don’t know if I’d prefer to apply the patch (if a test is provided) or to raise an exception instead of silently changing behavior.
msg134661 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-04-28 08:20
Éric Araujo wrote:
> 
> Éric Araujo <merwok@netwok.org> added the comment:
> 
> I’m not sure how I feel about using surrogateescape.  The distutils source is very similar across 2.7, 3.1, 3.2 and default, especially after the Great Revert and freeze last year to restore buggy-but-known behavior while the distutils2 project was created and allowed to fix things and break stuff.  Haypo added a fix using surrogateescape in 3.2, so it couldn’t be backported to all stable branches.  You may say that at least it was fixed in one version, which is something good.  I don’t know if I’d prefer to apply the patch (if a test is provided) or to raise an exception instead of silently changing behavior.

I think this patch should be applied to all 3.x versions, since
all of them are affected by the same problem: reading a file with
unknown encoding, adding a shebang and writing it back again.

Python shouldn't really care about the script file's encoding and
since the "surrogateescape" error handler is the only way to
more or less cleanly get around the problem, I'm +1 on adding the
patch to the 3.x series.

I don't think this is needed for 2.7, since Python 2.x's open()
doesn't care about the file encoding anyway.
msg134678 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * Date: 2011-04-28 14:29
Alternatively it's possible to use binary mode. I'm attaching the patch, which shows this possibility.
msg134680 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-04-28 14:48
Was the patch tested in 2.7 only?  I think the first_line_re needs to be changed to bytes too.  (3.x would have disallowed mixing bytes and str for a regex.)
msg134681 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * Date: 2011-04-28 14:52
Which patch do you mean?
(My patch already changes first_line_re to bytes. My patch was tested only with 3.2. Lib/distutils/command/build_scripts.py is currently identical in 3.1, 3.2 and 3.3.)
msg134773 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-04-29 15:19
Indeed, I missed those two lines.
msg134894 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * Date: 2011-04-30 23:43
Apparently setuptools.command.easy_install.get_script_header() imports distutils.command.build_scripts.first_line_re and checks if this regex matches a str object, which results in TypeError. If breaking compatibility is not acceptable, then the surrogateescape patch should be applied.
msg134934 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-05-01 22:21
Hey, I had already this bug and I also wrote a patch: copy_script-2.patch attached to #6011. It is very similar to build_scripts-binary_mode.patch (read the file in binary mode to avoid the encode/decode dance). But it checks also that the path to Python program is decodable from UTF-8 and from the script encoding.

Éric Araujo doesn't want to apply copy_script-2.patch on Python 3 before distutils2 is ported to Python 3 and included into Python (3.3): read msg124648. Five months later: distutils2 is not yet included to Python 3, the patch is not commited yet, and we have now a duplicate issue (and 3 patches for a single bug) :-)

This situation sucks. How can we move forward? What is the status of distutils2? Is it ported to Python3? Is it ready for an inclusion into Python3?

When distutils2 will be part of Python 3.3, should we fix distutils bugs or not? I suppose that few people use Python 3.3, maybe because it will not be released before August 2012 (PEP 398) :-) So users will continue to have this bug until everybody moves to 3.3 (or later)...

I think that we should fix this bug today. I don't really care of distutils2 today because it is not yet part of Python.
msg134936 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-05-01 22:27
> Apparently setuptools.command.easy_install.get_script_header() imports
> distutils.command.build_scripts.first_line_re and checks if this regex
> matches a str object, which results in TypeError. If breaking
> compatibility is not acceptable, then the surrogateescape patch should
> be applied.

Setuptools is not compatible with 3.x TTBOMK; distribute is, but could
be fixed quickly, so there is no compat problem with this (these)
library(ries).  However, the public/private status of first_line_re is
unclear, so there could be other projects out there depending on its
type.  Given that there is already one patch in distutils that uses
surrogateescape, I think we could accept another similar patch.
msg134937 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-05-01 22:35
is not commited yet,
> and we have now a duplicate issue (and 3 patches for a single bug) :-)
Feel free to close duplicate issues.

Looks like you’re not following PyCon reports, or Tarek’s mails to
python-dev.  distutils2 has been ported to 3.3 under the name
“packaging”; there is a repo on bitbucket (tarek/cpython) with this
code.  Tarek will produce a patch from this repo and push it to the main
repository soon.

Yes: we’ll fix bugs in packaging and distutils.  Packaging releases will
be backported for 2.4-3.2 under the name “distutils2”.
msg134971 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * Date: 2011-05-02 13:51
copy_script-2.patch uses os.fsencode(), which doesn't exist in Python 3.1.
msg134972 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-05-02 13:53
> copy_script-2.patch uses os.fsencode(), which doesn't exist in Python 3.1.

Correct, with Python 3.1, you can use filename.encode(sys.getfilesystemencoding(), 'surrogateescape'). But you must use os.fsencode() with Python >= 3.2 because on Windows, you cannot use surrogateescape with MBCS (you should use the strict error handler).
msg135374 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * Date: 2011-05-06 22:15
Please commit any patch before releases of Python 3.1.4 and 3.2.1. (3.2.1 rc1 is planned on 2011-05-14.)
msg135749 - (view) Author: Roundup Robot (python-dev) Date: 2011-05-10 22:15
New changeset 6ad356525381 by Victor Stinner in branch 'default':
Close #10419, issue #6011: build_scripts command of distutils handles correctly
http://hg.python.org/cpython/rev/6ad356525381
msg135752 - (view) Author: Roundup Robot (python-dev) Date: 2011-05-10 22:32
New changeset 47236a0cfb15 by Victor Stinner in branch '3.2':
Close #10419, issue #6011: build_scripts command of distutils handles correctly
http://hg.python.org/cpython/rev/47236a0cfb15
msg135754 - (view) Author: Roundup Robot (python-dev) Date: 2011-05-10 22:59
New changeset fd7d4639dae2 by Victor Stinner in branch '3.1':
Issue #10419: Fix build_scripts command of distutils to handle correctly
http://hg.python.org/cpython/rev/fd7d4639dae2
msg135756 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-05-10 23:06
Issue fixed in Python 3.1, 3.2, 3.3.

Thanks to Arfrever, I realized that this issue not only concerns the compilation of Python itself with a non-ASCII prefix (issue #6011), but the installation of any Python script containing a non-ASCII character. So I also fixed it in Python 3.1. I replaced os.fsencode(name) by name.encode(sys.getfilesystemencoding(), 'surrogateescape') in 3.1.
msg135786 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * Date: 2011-05-11 17:10
I have committed the fix for Distribute:
https://bitbucket.org/tarek/distribute/changeset/97f12f8f6bf1

(However Distribute would fail to create entry points scripts if sys.executable contained unencodable characters.)
msg136289 - (view) Author: Roundup Robot (python-dev) Date: 2011-05-19 13:18
New changeset cc5cfeaa4a8d by Victor Stinner in branch 'default':
Issue #10419, issue #6011: port 6ad356525381 fix from distutils to packaging
http://hg.python.org/cpython/rev/cc5cfeaa4a8d
History
Date User Action Args
2011-05-19 13:18:46python-devsetmessages: + msg136289
2011-05-11 17:10:22Arfreversetmessages: + msg135786
2011-05-10 23:06:21hayposetmessages: + msg135756
2011-05-10 22:59:44python-devsetmessages: + msg135754
2011-05-10 22:32:15python-devsetmessages: + msg135752
2011-05-10 22:15:39python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg135749

resolution: fixed
stage: resolved
2011-05-07 09:49:32hayposetpriority: normal -> release blocker
nosy: + benjamin.peterson, georg.brandl
2011-05-06 22:15:08Arfreversetmessages: + msg135374
2011-05-02 13:53:21hayposetmessages: + msg134972
2011-05-02 13:51:24Arfreversetmessages: + msg134971
2011-05-01 22:35:59eric.araujosetmessages: + msg134937
2011-05-01 22:27:47eric.araujosetmessages: + msg134936
2011-05-01 22:21:05hayposetmessages: + msg134934
2011-04-30 23:43:37Arfreversetmessages: + msg134894
2011-04-29 15:19:19eric.araujosetmessages: + msg134773
2011-04-28 14:52:40Arfreversetmessages: + msg134681
2011-04-28 14:48:59eric.araujosetmessages: + msg134680
2011-04-28 14:29:28Arfreversetfiles: + build_scripts-binary_mode.patch

messages: + msg134678
title: distutils command build_scripts fails with UnicodeDecodeError -> distutils command build_scripts fails with UnicodeDecodeError
2011-04-28 08:20:33lemburgsetnosy: + lemburg
title: distutils command build_scripts fails with UnicodeDecodeError -> distutils command build_scripts fails with UnicodeDecodeError
messages: + msg134661
2011-04-27 23:38:56eric.araujosetversions: + 3rd party, Python 2.7
nosy: + alexis

messages: + msg134630

components: + Distutils2
2011-04-27 17:16:50Arfreversetnosy: + haypo, Arfrever

versions: + Python 3.3
2011-02-04 03:44:00belopolskysetnosy: tarek, eric.araujo, hagen, mgorny
type: crash -> behavior
2010-11-15 08:13:31mgornysetnosy: + mgorny
2010-11-14 20:32:31hagencreate