distutils: set encoding to utf-8 for input and output files #53770

vstinner · 2010-08-10T17:51:38Z

BPO	9561
Nosy	@malemburg, @vstinner, @tarekziade, @merwok, @abadger, @ikelos, @mgorny
Files	pkginfo_utf8.patch packaging_pkginfo_utf8.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/merwok'
closed_at = <Date 2013-01-03.01:26:51.398>
created_at = <Date 2010-08-10.17:51:37.803>
labels = ['type-bug', 'library', 'expert-unicode']
title = 'distutils: set encoding to utf-8 for input and output files'
updated_at = <Date 2013-01-03.01:26:51.398>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2013-01-03.01:26:51.398>
actor = 'vstinner'
assignee = 'eric.araujo'
closed = True
closed_date = <Date 2013-01-03.01:26:51.398>
closer = 'vstinner'
components = ['Distutils', 'Unicode', 'Distutils2']
creation = <Date 2010-08-10.17:51:37.803>
creator = 'vstinner'
dependencies = []
files = ['22523', '22524']
hgrepos = []
issue_num = 9561
keywords = ['patch']
message_count = 25.0
messages = ['113552', '113584', '113725', '116061', '116260', '116261', '116317', '120689', '121215', '136812', '136974', '136975', '138820', '138822', '139483', '139487', '139528', '139656', '141569', '141773', '143567', '143569', '143570', '143611', '144271']
nosy_count = 10.0
nosy_names = ['lemburg', 'vstinner', 'tarek', 'eric.araujo', 'hagen', 'a.badger', 'Arfrever', 'ikelos', 'mgorny', 'python-dev']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue9561'
versions = ['3rd party', 'Python 2.7', 'Python 3.2', 'Python 3.3']

vstinner · 2010-08-10T17:51:37Z

While working on bpo-9425 (support non-ascii characters in python directory name with ascii locale), I wrote a patch for distutils.file_util(): set encoding to utf-8 and errors to surrogateescape. See the patch with comments at:
http://codereview.appspot.com/1874048/patch/1/9

(the patch is not enough, it should also patch *all* functions reading files)

I discussed with takek who told me that it is documented that distutils files have to be utf-8. I didn't found the documentation. I checked read_manifest() in sdist command: in Python2 and Python3, it uses open(name) syntax. It means that Python2 uses the binary API (bytes), whereas Python3 uses the text API (unicode characters) and Python3 relies on open() (TextIOWrapper) heuristic to *guess* the file encoding.

I think that it will be better to specify the encoding in Python3, and maybe use the text API in Python2.

Anyway, before going futher (work on patches), I would like the approval of distutils maintainer(s).

merwok · 2010-08-11T03:21:26Z

There are different kind of files created by write_file:

PKG-INFO (METADATA in distutil2), that already uses a trick to support Unicode, but your change would replace it in a better way;
MANIFEST, which with your fix would gain the ability to handle non-ASCII paths, which is a feature or a bugfix depending on your point of view;
.def files, used by the compilers for the C linking step; I don’t know if it’s appropriate to allow UTF-8 there.
RPM spec files, which use ASCII or UTF-8 according to http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but it’s not confirmed in http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked from the LSB site), so there’s no guarantee this works for all RPM platforms. This sort of platform-specific thing is the reason why RPM support has been removed in distutils2.
record and .pth files created by the install command.

I agree that there is something to be fixed, but I don’t know if they can be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas there are files or directories in MANIFEST, spec, record and .pth. If this is going to be fixed, write_file should not use UTF-8 unconditionally but grow a keyword argument IMO, so that use cases requiring ASCII continue to work.

When you say “patch *all* functions reading files”, I guess you mean all functions that read distutils files, i.e. MANIFEST and PKG-INFO.

Tarek, is this a bug fix or a feature? Could it break third-party tools?

vstinner · 2010-08-13T00:18:53Z

PKG-INFO (METADATA in distutil2), that already uses a trick to support
Unicode, but your change would replace it in a better way;

Which "trick"?

MANIFEST, which with your fix would gain the ability to handle non-ASCII
paths, which is a feature or a bugfix depending on your point of view;

Wait. Non encodable bytes is a separated issue. I would like to work on the
first problem: distutils in Python3 uses open() without encoding argument and
so the encoding depends on the user's locale. Said differently: if you produce
a file with distutils on a computer, you cannot be sure that the file can be
read with the same version of Python on other computer (if the locale encoding
is different). Eg. Windows uses mbcs encoding whereas utf-8 is the preferred
encoding on Linux.

What is the encoding of the MANIFEST file?

.def files, used by the compilers for the C linking step; I don’t know if
it’s appropriate to allow UTF-8 there.

I don't know these files.

RPM spec files, which use ASCII or UTF-8 according to
http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but
it’s not confirmed in
http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked
from the LSB site), so there’s no guarantee this works for all RPM
platforms. This sort of platform-specific thing is the reason why RPM
support has been removed in distutils2.

UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
characters, your output file will be written to utf-8... but it will be also
encoded to ascii. It's magical :-)

record and .pth files created by the install command.

.pth contain directory names which can be non-ASCII.

I agree that there is something to be fixed, but I don’t know if they can
be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas
there are files or directories in MANIFEST, spec, record and .pth.

You can use non-ASCII characters for other topics than filenames. Eg. in a
description of a package :-)

If this is going to be fixed, write_file should not use UTF-8 unconditionally
but grow a keyword argument IMO, so that use cases requiring ASCII
continue to work.

As written before, UTF-8 is a superset of ASCII. If you read a file using utf-8
encoding, you will be able to read ascii files. But if you use utf-8 and write
non-ascii characters, old version of distutils using ascii or other encoding
will not be able to read these files.

Anyway, I think that in most cases, all files only contain ASCII text. So it
doesn't really matter.

About the keyword solution: yes, it would be a smooth way to fix this issue.

When you say “patch *all* functions reading files”, I guess you mean all
functions that read distutils files, i.e. MANIFEST and PKG-INFO.

I don't know distutils to answer to my own question.

vstinner · 2010-09-10T23:33:35Z

I attached a patch to bpo-6011 to set the encoding to read the Makefile.

merwok · 2010-09-13T00:49:09Z

[Toshio, I made you nosy for a question about RPM .spec files]

> - PKG-INFO (METADATA in distutil2), that already uses a trick to support
> Unicode, but your change would replace it in a better way;
Which "trick"?

Some values are explicitly allowed to use Unicode and are encoded to UTF-8
when queried.

> - MANIFEST, which with your fix would gain the ability to handle non-ASCII
> paths, which is a feature or a bugfix depending on your point of view;
Wait. Non encodable bytes is a separated issue. I would like to work on the
first problem: distutils in Python3 uses open() without encoding argument and
so the encoding depends on the user's locale. Said differently: if you produce
a file with distutils on a computer, you cannot be sure that the file can be
read with the same version of Python on other computer (if the locale encoding
is different). Eg. Windows uses mbcs encoding whereas utf-8 is the preferred
encoding on Linux.

What is the encoding of the MANIFEST file?

Python’s default encoding, unfortunately. Try listing “napoléon” in a MANIFEST
file and you’ll get a UnicodeEncodeError because the file wants ASCII.

> - .def files, used by the compilers for the C linking step; I don’t know if
> it’s appropriate to allow UTF-8 there.

I don't know these files.

So we’ll have to get advice from someone well-versed in C linking.

> - RPM spec files, which use ASCII or UTF-8 according to
> http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but
> it’s not confirmed in
> http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked
> from the LSB site), so there’s no guarantee this works for all RPM
> platforms. This sort of platform-specific thing is the reason why RPM
> support has been removed in distutils2.
UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
characters, your output file will be written to utf-8... but it will be also
encoded to ascii. It's magical :-)

I know that, but it does not answer the question: Is it okay for these files
to use UTF-8?

> - record and .pth files created by the install command.
.pth contain directory names which can be non-ASCII.

Agreed.

> I agree that there is something to be fixed, but I don’t know if they can
> be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas
> there are files or directories in MANIFEST, spec, record and .pth.
You can use non-ASCII characters for other topics than filenames. Eg. in a
description of a package :-)

See above: The description of a distribution is in UTF-8. Note that I don’t
really understand my comment anymore; I now think that this should be fixed
in distutils with the least intrusive change possible.

> If this is going to be fixed, write_file should not use UTF-8 unconditionally
> but grow a keyword argument IMO, so that use cases requiring ASCII
> continue to work.
As written before, UTF-8 is a superset of ASCII. If you read a file using utf-8
encoding, you will be able to read ascii files. But if you use utf-8 and write
non-ascii characters, old version of distutils using ascii or other encoding
will not be able to read these files.

That’s what I meant: Don’t make write_file always use UTF-8 since some use cases are restricted to ASCII.

About the keyword solution: yes, it would be a smooth way to fix this issue.

Let’s do it. (Make sys.getdefaultencoding() its default value for compat.)

> When you say “patch *all* functions reading files”, I guess you mean all
> functions that read distutils files, i.e. MANIFEST and PKG-INFO.
I don't know distutils to answer to my own question.

You patch writing files, I’ll handle reading files :)

merwok · 2010-09-13T00:51:39Z

Note that any change requires a test.

abadger · 2010-09-13T15:37:05Z

>> - RPM spec files, which use ASCII or UTF-8 according to
>> http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but
>> it’s not confirmed in
>> http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked
>> from the LSB site)
> UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
> characters, your output file will be written to utf-8... but it will be also
> encoded to ascii. It's magical :-)

I know that, but it does not answer the question: Is it okay for these files
to use UTF-8?

rpm spec files are encoding agnostic similar to POSIX filesystems. This causes no end of troubles for people writing python code that deals with python of course, as they cannot rely on the bytes that they are dealing with from one package to another to have the same encoding (Remember that things like dependency solvers have to compare the information from multiple packages to make their decisions).

Individual distributions will have different policies about encoding and the use of unicode in spec files to try and mitigate the problems. For instance, Fedora specifies utf-8 in the spec files and additionally specifies that package names must be ascii. (So if there's a package name: python-café, we would likely transcribe it as python-cafe when we made a package for it).

utf-8 is a good default for locales on POSIX systems so it's a good default for encoding spec files but I know there's some people out there who make their own packages that aren't utf-8. I haven't checked but I also wouldn't be surprised if some Asian countries (where the bytes-per-character with utf-8 is high) have local distributions that use non-utf-8 encoding as well. Whether either of these use cases needs to be catered to in distutils (when the support is going away in distutils2) I'll leave to someone else to decide. My personal gut instinct is no but I'm not one of the people using a non-utf-8 locale.

vstinner · 2010-11-07T14:49:29Z

This issue might be splitted in multiple issue: one issue per file type (eg. Makefile, RPM spec file, etc.).

hagen · 2010-11-15T07:15:59Z

Created bpo-10419 for the encoding problem in "build_scripts".

vstinner · 2011-05-24T23:46:05Z

I started to patch packaging to fix this issue in the packaging module: issue bpo-12112. We might leave distutils unchanged and improve the packaging module instead (because previous experiments proved that distutils should not be touched or it break random stuffs!).

merwok · 2011-05-26T15:55:39Z

Definitely. We can fix real bugs in distutils, but sometimes it’s best to avoid disruptive changes and let distutils with its buggy behavior and let the packaging module have the best behavior.

malemburg · 2011-05-26T16:04:01Z

Éric Araujo wrote:

Éric Araujo <merwok@netwok.org> added the comment:

Definitely. We can fix real bugs in distutils, but sometimes it’s best to avoid disruptive changes and let distutils with its buggy behavior and let the packaging module have the best behavior.

This is a real bug, since we agreed long ago that distutils should
read and write files using the UTF-8 encoding.

mgorny · 2011-06-22T11:19:35Z

Now that installing scripts with unicode characters was fixed, shall I open a separate bug for writing egg files with utf8 chars in author name?

Arfrever · 2011-06-22T13:17:24Z

Please file a separate issue.

vstinner · 2011-06-30T14:54:49Z

pkginfo_utf8.patch: distutils uses UTF-8 to write PKG-INFO and .egg-info, instead of the locale encoding. It should be applied to 2.7, 3.2 and 3.3.

packaging_pkginfo_utf8.patch: packaging tests use UTF-8 to write PKG-INFO files, instead of the locale encoding (cosmetic change, the file content is an empty string :-)). It should only be applied to 3.3 (packaging has been introduced in Python 3.3).

merwok · 2011-06-30T15:08:32Z

pkginfo_utf8.patch: distutils uses UTF-8 to write PKG-INFO and
.egg-info, instead of the locale encoding. It should be applied to
2.7, 3.2 and 3.3.

Okay. I guess you’ll use codecs.open in 2.7; please make sure there is no bootstrapping issue for the build of CPython itself.

It would be a good thing to have non-ASCII in the PGK-INFO/METADATA files in the tests; it’s how we caught bpo-12320.

vstinner · 2011-06-30T22:04:56Z

Okay. I guess you’ll use codecs.open in 2.7

Oh, Python 2.7... DistributionMetadata of distutils encodes most values to byte strings (get_xxx() methods calls self._encode_field). It would be possible to use codecs.open(), but an Unicode file expects Unicode strings. The problem is that the user may provide arbitrary byte strings, I mean strings not encoded to PKG_INFO_ENCODING. Even if such strings are *wrong* (not correctly encoded), is it a good idea to be more strict in a minor version (2.7.x)?

I don't want to be responsible of such tricky change, I prefer to leave distutils unchanged in Python 2.7 (at least for PKG-INFO).

please make sure there is no bootstrapping issue
for the build of CPython itself.

I checked, there is not bootstrap issue.

merwok · 2011-07-02T14:32:31Z

> Okay. I guess you’ll use codecs.open in 2.7
Oh, Python 2.7... DistributionMetadata of distutils encodes most
values to byte strings (get_xxx() methods calls self._encode_field).
I forgot that. No change is needed in 2.7.

I checked, there is not bootstrap issue.
I was talking about bootstrapping if a change to use codecs was made.

mgorny · 2011-08-02T15:28:54Z

Ping. What's the progress on this? Will this ever be fixed?

vstinner · 2011-08-08T11:42:58Z

Ping. What's the progress on this? Will this ever be fixed?

Some functions has been fixed in the new packaging module, but not in
the distutils yet.

python-dev · 2011-09-05T21:50:28Z

New changeset fb4d2e6d393e by Victor Stinner in branch '3.2':
Issue bpo-9561: distutils now reads and writes egg-info files using UTF-8
http://hg.python.org/cpython/rev/fb4d2e6d393e

New changeset 3c080bf75342 by Victor Stinner in branch 'default':
Merge 3.2: Issue bpo-9561: distutils now reads and writes egg-info files using UTF-8
http://hg.python.org/cpython/rev/3c080bf75342

vstinner · 2011-09-05T22:04:52Z

I applied pkginfo_utf8.patch to Python 3.2 and 3.3. Python 2.7 is not affected, it does already encode Unicode to UTF-8.

python-dev · 2011-09-05T22:11:40Z

New changeset 56ab3257ca13 by Victor Stinner in branch 'default':
Issue bpo-9561: packaging now writes egg-info files using UTF-8
http://hg.python.org/cpython/rev/56ab3257ca13

merwok · 2011-09-06T15:19:05Z

I applied pkginfo_utf8.patch to Python 3.2 and 3.3.

If you apply patches to distutils, please add tests for the fixed behavior. (Sorry if I wasn’t reactive on this one.)

merwok · 2011-09-19T13:34:59Z

I backported your last change to distutils2 as f5a74b1f9473.

vstinner assigned tarekziade Aug 10, 2010

vstinner added stdlib Python modules in the Lib dir topic-unicode labels Aug 10, 2010

merwok assigned merwok and unassigned tarekziade Sep 13, 2010

merwok added the type-bug An unexpected behavior, bug, or error label Sep 13, 2010

vstinner closed this as completed Jan 3, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distutils: set encoding to utf-8 for input and output files #53770

distutils: set encoding to utf-8 for input and output files #53770

vstinner commented Aug 10, 2010

vstinner commented Aug 10, 2010

merwok commented Aug 11, 2010

vstinner commented Aug 13, 2010

vstinner commented Sep 10, 2010

merwok commented Sep 13, 2010

merwok commented Sep 13, 2010

abadger mannequin commented Sep 13, 2010

vstinner commented Nov 7, 2010

hagen mannequin commented Nov 15, 2010

vstinner commented May 24, 2011

merwok commented May 26, 2011

malemburg commented May 26, 2011

mgorny mannequin commented Jun 22, 2011

Arfrever mannequin commented Jun 22, 2011

vstinner commented Jun 30, 2011

merwok commented Jun 30, 2011

vstinner commented Jun 30, 2011

merwok commented Jul 2, 2011

mgorny mannequin commented Aug 2, 2011

vstinner commented Aug 8, 2011

python-dev mannequin commented Sep 5, 2011

vstinner commented Sep 5, 2011

python-dev mannequin commented Sep 5, 2011

merwok commented Sep 6, 2011

merwok commented Sep 19, 2011

distutils: set encoding to utf-8 for input and output files #53770

distutils: set encoding to utf-8 for input and output files #53770

Comments

vstinner commented Aug 10, 2010

vstinner commented Aug 10, 2010

merwok commented Aug 11, 2010

vstinner commented Aug 13, 2010

vstinner commented Sep 10, 2010

merwok commented Sep 13, 2010

merwok commented Sep 13, 2010

abadger mannequin commented Sep 13, 2010

vstinner commented Nov 7, 2010

hagen mannequin commented Nov 15, 2010

vstinner commented May 24, 2011

merwok commented May 26, 2011

malemburg commented May 26, 2011

mgorny mannequin commented Jun 22, 2011

Arfrever mannequin commented Jun 22, 2011

vstinner commented Jun 30, 2011

merwok commented Jun 30, 2011

vstinner commented Jun 30, 2011

merwok commented Jul 2, 2011

mgorny mannequin commented Aug 2, 2011

vstinner commented Aug 8, 2011

python-dev mannequin commented Sep 5, 2011

vstinner commented Sep 5, 2011

python-dev mannequin commented Sep 5, 2011

merwok commented Sep 6, 2011

merwok commented Sep 19, 2011