classification
Title: distutils: set encoding to utf-8 for input and output files
Type: behavior Stage: patch review
Components: Distutils, Distutils2, Unicode Versions: Python 3.3, Python 3.2, Python 2.7, 3rd party
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: eric.araujo Nosy List: Arfrever, a.badger, eric.araujo, hagen, haypo, ikelos, lemburg, mgorny, python-dev, tarek
Priority: normal Keywords: patch

Created on 2010-08-10 17:51 by haypo, last changed 2013-01-03 01:26 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
pkginfo_utf8.patch haypo, 2011-06-30 14:54
packaging_pkginfo_utf8.patch haypo, 2011-06-30 14:55
Messages (25)
msg113552 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-08-10 17:51
While working on #9425 (support non-ascii characters in python directory name with ascii locale), I wrote a patch for distutils.file_util(): set encoding to utf-8 and errors to surrogateescape. See the patch with comments at:
http://codereview.appspot.com/1874048/patch/1/9

(the patch is not enough, it should also patch *all* functions reading files)

I discussed with takek who told me that it is documented that distutils files have to be utf-8. I didn't found the documentation. I checked read_manifest() in sdist command: in Python2 and Python3, it uses open(name) syntax. It means that Python2 uses the binary API (bytes), whereas Python3 uses the text API (unicode characters) and Python3 relies on open() (TextIOWrapper) heuristic to *guess* the file encoding.

I think that it will be better to specify the encoding in Python3, and maybe use the text API in Python2.

Anyway, before going futher (work on patches), I would like the approval of distutils maintainer(s).
msg113584 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-08-11 03:21
There are different kind of files created by write_file:

- PKG-INFO (METADATA in distutil2), that already uses a trick to support Unicode, but your change would replace it in a better way;

- MANIFEST, which with your fix would gain the ability to handle non-ASCII paths, which is a feature or a bugfix depending on your point of view;

- .def files, used by the compilers for the C linking step; I don’t know if it’s appropriate to allow UTF-8 there.

- RPM spec files, which use ASCII or UTF-8 according to http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but it’s not confirmed in http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked from the LSB site), so there’s no guarantee this works for all RPM platforms. This sort of platform-specific thing is the reason why RPM support has been removed in distutils2.

- record and .pth files created by the install command.

I agree that there is something to be fixed, but I don’t know if they can be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas there are files or directories in MANIFEST, spec, record and .pth. If this is going to be fixed, write_file should not use UTF-8 unconditionally but grow a keyword argument IMO, so that use cases requiring ASCII continue to work.

When you say “patch *all* functions reading files”, I guess you mean all functions that read distutils files, i.e. MANIFEST and PKG-INFO.

Tarek, is this a bug fix or a feature? Could it break third-party tools?
msg113725 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-08-13 00:18
> - PKG-INFO (METADATA in distutil2), that already uses a trick to support
> Unicode, but your change would replace it in a better way;

Which "trick"?

> - MANIFEST, which with your fix would gain the ability to handle non-ASCII
> paths, which is a feature or a bugfix depending on your point of view;

Wait. Non encodable bytes is a separated issue. I would like to work on the 
first problem: distutils in Python3 uses open() without encoding argument and 
so the encoding depends on the user's locale. Said differently: if you produce 
a file with distutils on a computer, you cannot be sure that the file can be 
read with the same version of Python on other computer (if the locale encoding 
is different). Eg. Windows uses mbcs encoding whereas utf-8 is the preferred 
encoding on Linux.

What is the encoding of the MANIFEST file?

> - .def files, used by the compilers for the C linking step; I don’t know if
> it’s appropriate to allow UTF-8 there.

I don't know these files.

> - RPM spec files, which use ASCII or UTF-8 according to
> http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but
> it’s not confirmed in
> http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked
> from the LSB site), so there’s no guarantee this works for all RPM
> platforms. This sort of platform-specific thing is the reason why RPM
> support has been removed in distutils2.

UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii 
characters, your output file will be written to utf-8... but it will be also 
encoded to ascii. It's magical :-)

> - record and .pth files created by the install command.

.pth contain directory names which can be non-ASCII.

> I agree that there is something to be fixed, but I don’t know if they can
> be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas
> there are files or directories in MANIFEST, spec, record and .pth.

You can use non-ASCII characters for other topics than filenames. Eg. in a 
description of a package :-)

> If this is going to be fixed, write_file should not use UTF-8 unconditionally
> but grow a keyword argument IMO, so that use cases requiring ASCII 
> continue to work.

As written before, UTF-8 is a superset of ASCII. If you read a file using utf-8 
encoding, you will be able to read ascii files. But if you use utf-8 and write 
non-ascii characters, old version of distutils using ascii or other encoding 
will not be able to read these files.

Anyway, I think that in most cases, all files only contain ASCII text. So it 
doesn't really matter.

About the keyword solution: yes, it would be a smooth way to fix this issue.

> When you say “patch *all* functions reading files”, I guess you mean all
> functions that read distutils files, i.e. MANIFEST and PKG-INFO.

I don't know distutils to answer to my own question.
msg116061 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-10 23:33
I attached a patch to #6011 to set the encoding to read the Makefile.
msg116260 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-09-13 00:49
[Toshio, I made you nosy for a question about RPM .spec files]

>> - PKG-INFO (METADATA in distutil2), that already uses a trick to support
>> Unicode, but your change would replace it in a better way;
> Which "trick"?

Some values are explicitly allowed to use Unicode and are encoded to UTF-8
when queried.

>> - MANIFEST, which with your fix would gain the ability to handle non-ASCII
>> paths, which is a feature or a bugfix depending on your point of view;
> Wait. Non encodable bytes is a separated issue. I would like to work on the
> first problem: distutils in Python3 uses open() without encoding argument and
> so the encoding depends on the user's locale. Said differently: if you produce
> a file with distutils on a computer, you cannot be sure that the file can be
> read with the same version of Python on other computer (if the locale encoding
> is different). Eg. Windows uses mbcs encoding whereas utf-8 is the preferred
> encoding on Linux.
>
> What is the encoding of the MANIFEST file?

Python’s default encoding, unfortunately.  Try listing “napoléon” in a MANIFEST
file and you’ll get a UnicodeEncodeError because the file wants ASCII.

>> - .def files, used by the compilers for the C linking step; I don’t know if
>> it’s appropriate to allow UTF-8 there.
>
> I don't know these files.

So we’ll have to get advice from someone well-versed in C linking.

>> - RPM spec files, which use ASCII or UTF-8 according to
>> http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but
>> it’s not confirmed in
>> http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked
>> from the LSB site), so there’s no guarantee this works for all RPM
>> platforms. This sort of platform-specific thing is the reason why RPM
>> support has been removed in distutils2.
> UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
> characters, your output file will be written to utf-8... but it will be also
> encoded to ascii. It's magical :-)

I know that, but it does not answer the question:  Is it okay for these files
to use UTF-8?

>> - record and .pth files created by the install command.
> .pth contain directory names which can be non-ASCII.

Agreed.

>> I agree that there is something to be fixed, but I don’t know if they can
>> be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas
>> there are files or directories in MANIFEST, spec, record and .pth.
> You can use non-ASCII characters for other topics than filenames. Eg. in a
> description of a package :-)

See above: The description of a distribution is in UTF-8.  Note that I don’t
really understand my comment anymore; I now think that this should be fixed
in distutils with the least intrusive change possible.

>> If this is going to be fixed, write_file should not use UTF-8 unconditionally
>> but grow a keyword argument IMO, so that use cases requiring ASCII
>> continue to work.
> As written before, UTF-8 is a superset of ASCII. If you read a file using utf-8
> encoding, you will be able to read ascii files. But if you use utf-8 and write
> non-ascii characters, old version of distutils using ascii or other encoding
> will not be able to read these files.

That’s what I meant: Don’t make write_file always use UTF-8 since some use cases are restricted to ASCII.

> About the keyword solution: yes, it would be a smooth way to fix this issue.

Let’s do it.  (Make sys.getdefaultencoding() its default value for compat.)

>> When you say “patch *all* functions reading files”, I guess you mean all
>> functions that read distutils files, i.e. MANIFEST and PKG-INFO.
> I don't know distutils to answer to my own question.

You patch writing files, I’ll handle reading files :)
msg116261 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-09-13 00:51
Note that any change requires a test.
msg116317 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2010-09-13 15:37
>>> - RPM spec files, which use ASCII or UTF-8 according to
>>> http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but
>>> it’s not confirmed in
>>> http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked
>>> from the LSB site)
>> UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
>> characters, your output file will be written to utf-8... but it will be also
>> encoded to ascii. It's magical :-)
>
> I know that, but it does not answer the question:  Is it okay for these files
> to use UTF-8?

rpm spec files are encoding agnostic similar to POSIX filesystems.  This causes no end of troubles for people writing python code that deals with python of course, as they cannot rely on the bytes that they are dealing with from one package to another to have the same encoding (Remember that things like dependency solvers have to compare the information from multiple packages to make their decisions).

Individual distributions will have different policies about encoding and the use of unicode in spec files to try and mitigate the problems.  For instance, Fedora specifies utf-8 in the spec files and additionally specifies that package names must be ascii.  (So if there's a package name: python-café, we would likely transcribe it as python-cafe when we made a package for it).

utf-8 is a good default for locales on POSIX systems so it's a good default for encoding spec files but I know there's some people out there who make their own packages that aren't utf-8.  I haven't checked but I also wouldn't be surprised if some Asian countries (where the bytes-per-character with utf-8 is high) have local distributions that use non-utf-8 encoding as well.  Whether either of these use cases needs to be catered to in distutils (when the support is going away in distutils2) I'll leave to someone else to decide.  My personal gut instinct is no but I'm not one of the people using a non-utf-8 locale.
msg120689 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-11-07 14:49
This issue might be splitted in multiple issue: one issue per file type (eg. Makefile, RPM spec file, etc.).
msg121215 - (view) Author: Hagen Fürstenau (hagen) Date: 2010-11-15 07:15
Created issue 10419 for the encoding problem in "build_scripts".
msg136812 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-05-24 23:46
I started to patch packaging to fix this issue in the packaging module: issue #12112. We might leave distutils unchanged and improve the packaging module instead (because previous experiments proved that distutils should not be touched or it break random stuffs!).
msg136974 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-05-26 15:55
Definitely.  We can fix real bugs in distutils, but sometimes it’s best to avoid disruptive changes and let distutils with its buggy behavior and let the packaging module have the best behavior.
msg136975 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-05-26 16:04
Éric Araujo wrote:
> 
> Éric Araujo <merwok@netwok.org> added the comment:
> 
> Definitely.  We can fix real bugs in distutils, but sometimes it’s best to avoid disruptive changes and let distutils with its buggy behavior and let the packaging module have the best behavior.

This is a real bug, since we agreed long ago that distutils should
read and write files using the UTF-8 encoding.
msg138820 - (view) Author: Michał Górny (mgorny) Date: 2011-06-22 11:19
Now that installing scripts with unicode characters was fixed, shall I open a separate bug for writing egg files with utf8 chars in author name?
msg138822 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * Date: 2011-06-22 13:17
Please file a separate issue.
msg139483 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-30 14:54
pkginfo_utf8.patch: distutils uses UTF-8 to write PKG-INFO and .egg-info, instead of the locale encoding. It should be applied to 2.7, 3.2 and 3.3.

packaging_pkginfo_utf8.patch: packaging tests use UTF-8 to write PKG-INFO files, instead of the locale encoding (cosmetic change, the file content is an empty string :-)). It should only be applied to 3.3 (packaging has been introduced in Python 3.3).
msg139487 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-06-30 15:08
> pkginfo_utf8.patch: distutils uses UTF-8 to write PKG-INFO and
> .egg-info, instead of the locale encoding. It should be applied to
> 2.7, 3.2 and 3.3.

Okay.  I guess you’ll use codecs.open in 2.7; please make sure there is no bootstrapping issue for the build of CPython itself.

It would be a good thing to have non-ASCII in the PGK-INFO/METADATA files in the tests; it’s how we caught #12320.
msg139528 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-30 22:04
> Okay.  I guess you’ll use codecs.open in 2.7

Oh, Python 2.7... DistributionMetadata of distutils encodes most values to byte strings (get_xxx() methods calls self._encode_field). It would be possible to use codecs.open(), but an Unicode file expects Unicode strings. The problem is that the user may provide arbitrary byte strings, I mean strings not encoded to PKG_INFO_ENCODING. Even if such strings are *wrong* (not correctly encoded), is it a good idea to be more strict in a minor version (2.7.x)?

I don't want to be responsible of such tricky change, I prefer to leave distutils unchanged in Python 2.7 (at least for PKG-INFO).

> please make sure there is no bootstrapping issue
> for the build of CPython itself.

I checked, there is not bootstrap issue.
msg139656 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-07-02 14:32
>> Okay.  I guess you’ll use codecs.open in 2.7
> Oh, Python 2.7... DistributionMetadata of distutils encodes most
> values to byte strings (get_xxx() methods calls self._encode_field).
I forgot that.  No change is needed in 2.7.

> I checked, there is not bootstrap issue.
I was talking about bootstrapping if a change to use codecs was made.
msg141569 - (view) Author: Michał Górny (mgorny) Date: 2011-08-02 15:28
Ping. What's the progress on this? Will this ever be fixed?
msg141773 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-08-08 11:42
> Ping. What's the progress on this? Will this ever be fixed?

Some functions has been fixed in the new packaging module, but not in 
the distutils yet.
msg143567 - (view) Author: Roundup Robot (python-dev) Date: 2011-09-05 21:50
New changeset fb4d2e6d393e by Victor Stinner in branch '3.2':
Issue #9561: distutils now reads and writes egg-info files using UTF-8
http://hg.python.org/cpython/rev/fb4d2e6d393e

New changeset 3c080bf75342 by Victor Stinner in branch 'default':
Merge 3.2: Issue #9561: distutils now reads and writes egg-info files using UTF-8
http://hg.python.org/cpython/rev/3c080bf75342
msg143569 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-09-05 22:04
I applied pkginfo_utf8.patch to Python 3.2 and 3.3. Python 2.7 is not affected, it does already encode Unicode to UTF-8.
msg143570 - (view) Author: Roundup Robot (python-dev) Date: 2011-09-05 22:11
New changeset 56ab3257ca13 by Victor Stinner in branch 'default':
Issue #9561: packaging now writes egg-info files using UTF-8
http://hg.python.org/cpython/rev/56ab3257ca13
msg143611 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-09-06 15:19
> I applied pkginfo_utf8.patch to Python 3.2 and 3.3.

If you apply patches to distutils, please add tests for the fixed behavior.  (Sorry if I wasn’t reactive on this one.)
msg144271 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-09-19 13:34
I backported your last change to distutils2 as f5a74b1f9473.
History
Date User Action Args
2013-01-03 01:26:51hayposetstatus: open -> closed
resolution: fixed
2012-02-15 17:05:31eric.araujolinkissue14021 superseder
2011-09-19 13:34:58eric.araujosetmessages: + msg144271
2011-09-06 15:19:05eric.araujosetmessages: + msg143611
2011-09-05 22:11:39python-devsetmessages: + msg143570
2011-09-05 22:04:52hayposetmessages: + msg143569
2011-09-05 21:50:27python-devsetnosy: + python-dev
messages: + msg143567
2011-08-08 11:42:57hayposetmessages: + msg141773
2011-08-02 15:28:53mgornysetmessages: + msg141569
2011-07-02 14:32:31eric.araujosetmessages: + msg139656
2011-06-30 22:04:56hayposetmessages: + msg139528
2011-06-30 15:08:31eric.araujosetmessages: + msg139487
2011-06-30 14:55:03hayposetfiles: + packaging_pkginfo_utf8.patch
2011-06-30 14:54:49hayposetfiles: + pkginfo_utf8.patch
keywords: + patch
messages: + msg139483

versions: - Python 3.1
2011-06-22 13:17:23Arfreversetnosy: + Arfrever
messages: + msg138822
2011-06-22 11:19:35mgornysetmessages: + msg138820
2011-05-26 16:04:01lemburgsetnosy: + lemburg
messages: + msg136975
2011-05-26 15:55:39eric.araujosetmessages: + msg136974
versions: + Python 3.3
2011-05-24 23:46:05hayposetmessages: + msg136812
2010-11-15 07:15:59hagensetmessages: + msg121215
2010-11-07 14:49:29hayposetmessages: + msg120689
2010-10-31 14:12:07ikelossetnosy: + ikelos
2010-10-18 18:25:28mgornysetnosy: + mgorny
2010-10-18 18:08:42eric.araujolinkissue10051 superseder
2010-09-29 23:47:02eric.araujosetversions: + 3rd party
2010-09-21 11:08:24hagensetnosy: + hagen
2010-09-21 10:33:34eric.araujolinkissue9887 superseder
2010-09-13 15:37:05a.badgersetmessages: + msg116317
2010-09-13 00:51:39eric.araujosetversions: + Python 3.1, Python 2.7
messages: + msg116261

assignee: tarek -> eric.araujo
type: behavior
stage: patch review
2010-09-13 00:49:11eric.araujosetnosy: + a.badger
messages: + msg116260
2010-09-10 23:33:34hayposetmessages: + msg116061
2010-08-13 00:18:57hayposetmessages: + msg113725
2010-08-11 03:21:27eric.araujosetmessages: + msg113584
2010-08-10 17:51:37haypocreate