Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distutils: set encoding to utf-8 for input and output files #53770

Closed
vstinner opened this issue Aug 10, 2010 · 25 comments
Closed

distutils: set encoding to utf-8 for input and output files #53770

vstinner opened this issue Aug 10, 2010 · 25 comments
Assignees
Labels
stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

BPO 9561
Nosy @malemburg, @vstinner, @tarekziade, @merwok, @abadger, @ikelos, @mgorny
Files
  • pkginfo_utf8.patch
  • packaging_pkginfo_utf8.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/merwok'
    closed_at = <Date 2013-01-03.01:26:51.398>
    created_at = <Date 2010-08-10.17:51:37.803>
    labels = ['type-bug', 'library', 'expert-unicode']
    title = 'distutils: set encoding to utf-8 for input and output files'
    updated_at = <Date 2013-01-03.01:26:51.398>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2013-01-03.01:26:51.398>
    actor = 'vstinner'
    assignee = 'eric.araujo'
    closed = True
    closed_date = <Date 2013-01-03.01:26:51.398>
    closer = 'vstinner'
    components = ['Distutils', 'Unicode', 'Distutils2']
    creation = <Date 2010-08-10.17:51:37.803>
    creator = 'vstinner'
    dependencies = []
    files = ['22523', '22524']
    hgrepos = []
    issue_num = 9561
    keywords = ['patch']
    message_count = 25.0
    messages = ['113552', '113584', '113725', '116061', '116260', '116261', '116317', '120689', '121215', '136812', '136974', '136975', '138820', '138822', '139483', '139487', '139528', '139656', '141569', '141773', '143567', '143569', '143570', '143611', '144271']
    nosy_count = 10.0
    nosy_names = ['lemburg', 'vstinner', 'tarek', 'eric.araujo', 'hagen', 'a.badger', 'Arfrever', 'ikelos', 'mgorny', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue9561'
    versions = ['3rd party', 'Python 2.7', 'Python 3.2', 'Python 3.3']

    @vstinner
    Copy link
    Member Author

    While working on bpo-9425 (support non-ascii characters in python directory name with ascii locale), I wrote a patch for distutils.file_util(): set encoding to utf-8 and errors to surrogateescape. See the patch with comments at:
    http://codereview.appspot.com/1874048/patch/1/9

    (the patch is not enough, it should also patch *all* functions reading files)

    I discussed with takek who told me that it is documented that distutils files have to be utf-8. I didn't found the documentation. I checked read_manifest() in sdist command: in Python2 and Python3, it uses open(name) syntax. It means that Python2 uses the binary API (bytes), whereas Python3 uses the text API (unicode characters) and Python3 relies on open() (TextIOWrapper) heuristic to *guess* the file encoding.

    I think that it will be better to specify the encoding in Python3, and maybe use the text API in Python2.

    Anyway, before going futher (work on patches), I would like the approval of distutils maintainer(s).

    @vstinner vstinner added stdlib Python modules in the Lib dir topic-unicode labels Aug 10, 2010
    @merwok
    Copy link
    Member

    merwok commented Aug 11, 2010

    There are different kind of files created by write_file:

    • PKG-INFO (METADATA in distutil2), that already uses a trick to support Unicode, but your change would replace it in a better way;

    • MANIFEST, which with your fix would gain the ability to handle non-ASCII paths, which is a feature or a bugfix depending on your point of view;

    • .def files, used by the compilers for the C linking step; I don’t know if it’s appropriate to allow UTF-8 there.

    • RPM spec files, which use ASCII or UTF-8 according to http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but it’s not confirmed in http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked from the LSB site), so there’s no guarantee this works for all RPM platforms. This sort of platform-specific thing is the reason why RPM support has been removed in distutils2.

    • record and .pth files created by the install command.

    I agree that there is something to be fixed, but I don’t know if they can be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas there are files or directories in MANIFEST, spec, record and .pth. If this is going to be fixed, write_file should not use UTF-8 unconditionally but grow a keyword argument IMO, so that use cases requiring ASCII continue to work.

    When you say “patch *all* functions reading files”, I guess you mean all functions that read distutils files, i.e. MANIFEST and PKG-INFO.

    Tarek, is this a bug fix or a feature? Could it break third-party tools?

    @vstinner
    Copy link
    Member Author

    • PKG-INFO (METADATA in distutil2), that already uses a trick to support
      Unicode, but your change would replace it in a better way;

    Which "trick"?

    • MANIFEST, which with your fix would gain the ability to handle non-ASCII
      paths, which is a feature or a bugfix depending on your point of view;

    Wait. Non encodable bytes is a separated issue. I would like to work on the
    first problem: distutils in Python3 uses open() without encoding argument and
    so the encoding depends on the user's locale. Said differently: if you produce
    a file with distutils on a computer, you cannot be sure that the file can be
    read with the same version of Python on other computer (if the locale encoding
    is different). Eg. Windows uses mbcs encoding whereas utf-8 is the preferred
    encoding on Linux.

    What is the encoding of the MANIFEST file?

    • .def files, used by the compilers for the C linking step; I don’t know if
      it’s appropriate to allow UTF-8 there.

    I don't know these files.

    UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
    characters, your output file will be written to utf-8... but it will be also
    encoded to ascii. It's magical :-)

    • record and .pth files created by the install command.

    .pth contain directory names which can be non-ASCII.

    I agree that there is something to be fixed, but I don’t know if they can
    be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas
    there are files or directories in MANIFEST, spec, record and .pth.

    You can use non-ASCII characters for other topics than filenames. Eg. in a
    description of a package :-)

    If this is going to be fixed, write_file should not use UTF-8 unconditionally
    but grow a keyword argument IMO, so that use cases requiring ASCII
    continue to work.

    As written before, UTF-8 is a superset of ASCII. If you read a file using utf-8
    encoding, you will be able to read ascii files. But if you use utf-8 and write
    non-ascii characters, old version of distutils using ascii or other encoding
    will not be able to read these files.

    Anyway, I think that in most cases, all files only contain ASCII text. So it
    doesn't really matter.

    About the keyword solution: yes, it would be a smooth way to fix this issue.

    When you say “patch *all* functions reading files”, I guess you mean all
    functions that read distutils files, i.e. MANIFEST and PKG-INFO.

    I don't know distutils to answer to my own question.

    @vstinner
    Copy link
    Member Author

    I attached a patch to bpo-6011 to set the encoding to read the Makefile.

    @merwok
    Copy link
    Member

    merwok commented Sep 13, 2010

    [Toshio, I made you nosy for a question about RPM .spec files]

    > - PKG-INFO (METADATA in distutil2), that already uses a trick to support
    > Unicode, but your change would replace it in a better way;
    Which "trick"?

    Some values are explicitly allowed to use Unicode and are encoded to UTF-8
    when queried.

    > - MANIFEST, which with your fix would gain the ability to handle non-ASCII
    > paths, which is a feature or a bugfix depending on your point of view;
    Wait. Non encodable bytes is a separated issue. I would like to work on the
    first problem: distutils in Python3 uses open() without encoding argument and
    so the encoding depends on the user's locale. Said differently: if you produce
    a file with distutils on a computer, you cannot be sure that the file can be
    read with the same version of Python on other computer (if the locale encoding
    is different). Eg. Windows uses mbcs encoding whereas utf-8 is the preferred
    encoding on Linux.

    What is the encoding of the MANIFEST file?

    Python’s default encoding, unfortunately. Try listing “napoléon” in a MANIFEST
    file and you’ll get a UnicodeEncodeError because the file wants ASCII.

    > - .def files, used by the compilers for the C linking step; I don’t know if
    > it’s appropriate to allow UTF-8 there.

    I don't know these files.

    So we’ll have to get advice from someone well-versed in C linking.

    > - RPM spec files, which use ASCII or UTF-8 according to
    > http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but
    > it’s not confirmed in
    > http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked
    > from the LSB site), so there’s no guarantee this works for all RPM
    > platforms. This sort of platform-specific thing is the reason why RPM
    > support has been removed in distutils2.
    UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
    characters, your output file will be written to utf-8... but it will be also
    encoded to ascii. It's magical :-)

    I know that, but it does not answer the question: Is it okay for these files
    to use UTF-8?

    > - record and .pth files created by the install command.
    .pth contain directory names which can be non-ASCII.

    Agreed.

    > I agree that there is something to be fixed, but I don’t know if they can
    > be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas
    > there are files or directories in MANIFEST, spec, record and .pth.
    You can use non-ASCII characters for other topics than filenames. Eg. in a
    description of a package :-)

    See above: The description of a distribution is in UTF-8. Note that I don’t
    really understand my comment anymore; I now think that this should be fixed
    in distutils with the least intrusive change possible.

    > If this is going to be fixed, write_file should not use UTF-8 unconditionally
    > but grow a keyword argument IMO, so that use cases requiring ASCII
    > continue to work.
    As written before, UTF-8 is a superset of ASCII. If you read a file using utf-8
    encoding, you will be able to read ascii files. But if you use utf-8 and write
    non-ascii characters, old version of distutils using ascii or other encoding
    will not be able to read these files.

    That’s what I meant: Don’t make write_file always use UTF-8 since some use cases are restricted to ASCII.

    About the keyword solution: yes, it would be a smooth way to fix this issue.

    Let’s do it. (Make sys.getdefaultencoding() its default value for compat.)

    > When you say “patch *all* functions reading files”, I guess you mean all
    > functions that read distutils files, i.e. MANIFEST and PKG-INFO.
    I don't know distutils to answer to my own question.

    You patch writing files, I’ll handle reading files :)

    @merwok
    Copy link
    Member

    merwok commented Sep 13, 2010

    Note that any change requires a test.

    @merwok merwok assigned merwok and unassigned tarekziade Sep 13, 2010
    @merwok merwok added the type-bug An unexpected behavior, bug, or error label Sep 13, 2010
    @abadger
    Copy link
    Mannequin

    abadger mannequin commented Sep 13, 2010

    >> - RPM spec files, which use ASCII or UTF-8 according to
    >> http://en.opensuse.org/openSUSE:Specfile_guidelines#Specfile_Encoding but
    >> it’s not confirmed in
    >> http://www.rpm.org/max-rpm/s1-rpm-build-creating-spec-file.html (linked
    >> from the LSB site)
    > UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
    > characters, your output file will be written to utf-8... but it will be also
    > encoded to ascii. It's magical :-)

    I know that, but it does not answer the question: Is it okay for these files
    to use UTF-8?

    rpm spec files are encoding agnostic similar to POSIX filesystems. This causes no end of troubles for people writing python code that deals with python of course, as they cannot rely on the bytes that they are dealing with from one package to another to have the same encoding (Remember that things like dependency solvers have to compare the information from multiple packages to make their decisions).

    Individual distributions will have different policies about encoding and the use of unicode in spec files to try and mitigate the problems. For instance, Fedora specifies utf-8 in the spec files and additionally specifies that package names must be ascii. (So if there's a package name: python-café, we would likely transcribe it as python-cafe when we made a package for it).

    utf-8 is a good default for locales on POSIX systems so it's a good default for encoding spec files but I know there's some people out there who make their own packages that aren't utf-8. I haven't checked but I also wouldn't be surprised if some Asian countries (where the bytes-per-character with utf-8 is high) have local distributions that use non-utf-8 encoding as well. Whether either of these use cases needs to be catered to in distutils (when the support is going away in distutils2) I'll leave to someone else to decide. My personal gut instinct is no but I'm not one of the people using a non-utf-8 locale.

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 7, 2010

    This issue might be splitted in multiple issue: one issue per file type (eg. Makefile, RPM spec file, etc.).

    @hagen
    Copy link
    Mannequin

    hagen mannequin commented Nov 15, 2010

    Created bpo-10419 for the encoding problem in "build_scripts".

    @vstinner
    Copy link
    Member Author

    I started to patch packaging to fix this issue in the packaging module: issue bpo-12112. We might leave distutils unchanged and improve the packaging module instead (because previous experiments proved that distutils should not be touched or it break random stuffs!).

    @merwok
    Copy link
    Member

    merwok commented May 26, 2011

    Definitely. We can fix real bugs in distutils, but sometimes it’s best to avoid disruptive changes and let distutils with its buggy behavior and let the packaging module have the best behavior.

    @malemburg
    Copy link
    Member

    Éric Araujo wrote:

    Éric Araujo <merwok@netwok.org> added the comment:

    Definitely. We can fix real bugs in distutils, but sometimes it’s best to avoid disruptive changes and let distutils with its buggy behavior and let the packaging module have the best behavior.

    This is a real bug, since we agreed long ago that distutils should
    read and write files using the UTF-8 encoding.

    @mgorny
    Copy link
    Mannequin

    mgorny mannequin commented Jun 22, 2011

    Now that installing scripts with unicode characters was fixed, shall I open a separate bug for writing egg files with utf8 chars in author name?

    @Arfrever
    Copy link
    Mannequin

    Arfrever mannequin commented Jun 22, 2011

    Please file a separate issue.

    @vstinner
    Copy link
    Member Author

    pkginfo_utf8.patch: distutils uses UTF-8 to write PKG-INFO and .egg-info, instead of the locale encoding. It should be applied to 2.7, 3.2 and 3.3.

    packaging_pkginfo_utf8.patch: packaging tests use UTF-8 to write PKG-INFO files, instead of the locale encoding (cosmetic change, the file content is an empty string :-)). It should only be applied to 3.3 (packaging has been introduced in Python 3.3).

    @merwok
    Copy link
    Member

    merwok commented Jun 30, 2011

    pkginfo_utf8.patch: distutils uses UTF-8 to write PKG-INFO and
    .egg-info, instead of the locale encoding. It should be applied to
    2.7, 3.2 and 3.3.

    Okay. I guess you’ll use codecs.open in 2.7; please make sure there is no bootstrapping issue for the build of CPython itself.

    It would be a good thing to have non-ASCII in the PGK-INFO/METADATA files in the tests; it’s how we caught bpo-12320.

    @vstinner
    Copy link
    Member Author

    Okay. I guess you’ll use codecs.open in 2.7

    Oh, Python 2.7... DistributionMetadata of distutils encodes most values to byte strings (get_xxx() methods calls self._encode_field). It would be possible to use codecs.open(), but an Unicode file expects Unicode strings. The problem is that the user may provide arbitrary byte strings, I mean strings not encoded to PKG_INFO_ENCODING. Even if such strings are *wrong* (not correctly encoded), is it a good idea to be more strict in a minor version (2.7.x)?

    I don't want to be responsible of such tricky change, I prefer to leave distutils unchanged in Python 2.7 (at least for PKG-INFO).

    please make sure there is no bootstrapping issue
    for the build of CPython itself.

    I checked, there is not bootstrap issue.

    @merwok
    Copy link
    Member

    merwok commented Jul 2, 2011

    > Okay. I guess you’ll use codecs.open in 2.7
    Oh, Python 2.7... DistributionMetadata of distutils encodes most
    values to byte strings (get_xxx() methods calls self._encode_field).
    I forgot that. No change is needed in 2.7.

    I checked, there is not bootstrap issue.
    I was talking about bootstrapping if a change to use codecs was made.

    @mgorny
    Copy link
    Mannequin

    mgorny mannequin commented Aug 2, 2011

    Ping. What's the progress on this? Will this ever be fixed?

    @vstinner
    Copy link
    Member Author

    vstinner commented Aug 8, 2011

    Ping. What's the progress on this? Will this ever be fixed?

    Some functions has been fixed in the new packaging module, but not in
    the distutils yet.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 5, 2011

    New changeset fb4d2e6d393e by Victor Stinner in branch '3.2':
    Issue bpo-9561: distutils now reads and writes egg-info files using UTF-8
    http://hg.python.org/cpython/rev/fb4d2e6d393e

    New changeset 3c080bf75342 by Victor Stinner in branch 'default':
    Merge 3.2: Issue bpo-9561: distutils now reads and writes egg-info files using UTF-8
    http://hg.python.org/cpython/rev/3c080bf75342

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 5, 2011

    I applied pkginfo_utf8.patch to Python 3.2 and 3.3. Python 2.7 is not affected, it does already encode Unicode to UTF-8.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 5, 2011

    New changeset 56ab3257ca13 by Victor Stinner in branch 'default':
    Issue bpo-9561: packaging now writes egg-info files using UTF-8
    http://hg.python.org/cpython/rev/56ab3257ca13

    @merwok
    Copy link
    Member

    merwok commented Sep 6, 2011

    I applied pkginfo_utf8.patch to Python 3.2 and 3.3.

    If you apply patches to distutils, please add tests for the fixed behavior. (Sorry if I wasn’t reactive on this one.)

    @merwok
    Copy link
    Member

    merwok commented Sep 19, 2011

    I backported your last change to distutils2 as f5a74b1f9473.

    @vstinner vstinner closed this as completed Jan 3, 2013
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants