New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distutils: set encoding to utf-8 for input and output files #53770
Comments
While working on bpo-9425 (support non-ascii characters in python directory name with ascii locale), I wrote a patch for distutils.file_util(): set encoding to utf-8 and errors to surrogateescape. See the patch with comments at: (the patch is not enough, it should also patch *all* functions reading files) I discussed with takek who told me that it is documented that distutils files have to be utf-8. I didn't found the documentation. I checked read_manifest() in sdist command: in Python2 and Python3, it uses open(name) syntax. It means that Python2 uses the binary API (bytes), whereas Python3 uses the text API (unicode characters) and Python3 relies on open() (TextIOWrapper) heuristic to *guess* the file encoding. I think that it will be better to specify the encoding in Python3, and maybe use the text API in Python2. Anyway, before going futher (work on patches), I would like the approval of distutils maintainer(s). |
There are different kind of files created by write_file:
I agree that there is something to be fixed, but I don’t know if they can be fixed in distutils. Unicode in PKG-INFO is unrelated to files, whereas there are files or directories in MANIFEST, spec, record and .pth. If this is going to be fixed, write_file should not use UTF-8 unconditionally but grow a keyword argument IMO, so that use cases requiring ASCII continue to work. When you say “patch *all* functions reading files”, I guess you mean all functions that read distutils files, i.e. MANIFEST and PKG-INFO. Tarek, is this a bug fix or a feature? Could it break third-party tools? |
Which "trick"?
Wait. Non encodable bytes is a separated issue. I would like to work on the What is the encoding of the MANIFEST file?
I don't know these files.
UTF-8 is a superset of ASCII. If you use utf-8 but only write ascii
.pth contain directory names which can be non-ASCII.
You can use non-ASCII characters for other topics than filenames. Eg. in a
As written before, UTF-8 is a superset of ASCII. If you read a file using utf-8 Anyway, I think that in most cases, all files only contain ASCII text. So it About the keyword solution: yes, it would be a smooth way to fix this issue.
I don't know distutils to answer to my own question. |
I attached a patch to bpo-6011 to set the encoding to read the Makefile. |
[Toshio, I made you nosy for a question about RPM .spec files]
Some values are explicitly allowed to use Unicode and are encoded to UTF-8
Python’s default encoding, unfortunately. Try listing “napoléon” in a MANIFEST
So we’ll have to get advice from someone well-versed in C linking.
I know that, but it does not answer the question: Is it okay for these files
Agreed.
See above: The description of a distribution is in UTF-8. Note that I don’t
That’s what I meant: Don’t make write_file always use UTF-8 since some use cases are restricted to ASCII.
Let’s do it. (Make sys.getdefaultencoding() its default value for compat.)
You patch writing files, I’ll handle reading files :) |
Note that any change requires a test. |
rpm spec files are encoding agnostic similar to POSIX filesystems. This causes no end of troubles for people writing python code that deals with python of course, as they cannot rely on the bytes that they are dealing with from one package to another to have the same encoding (Remember that things like dependency solvers have to compare the information from multiple packages to make their decisions). Individual distributions will have different policies about encoding and the use of unicode in spec files to try and mitigate the problems. For instance, Fedora specifies utf-8 in the spec files and additionally specifies that package names must be ascii. (So if there's a package name: python-café, we would likely transcribe it as python-cafe when we made a package for it). utf-8 is a good default for locales on POSIX systems so it's a good default for encoding spec files but I know there's some people out there who make their own packages that aren't utf-8. I haven't checked but I also wouldn't be surprised if some Asian countries (where the bytes-per-character with utf-8 is high) have local distributions that use non-utf-8 encoding as well. Whether either of these use cases needs to be catered to in distutils (when the support is going away in distutils2) I'll leave to someone else to decide. My personal gut instinct is no but I'm not one of the people using a non-utf-8 locale. |
This issue might be splitted in multiple issue: one issue per file type (eg. Makefile, RPM spec file, etc.). |
Created bpo-10419 for the encoding problem in "build_scripts". |
I started to patch packaging to fix this issue in the packaging module: issue bpo-12112. We might leave distutils unchanged and improve the packaging module instead (because previous experiments proved that distutils should not be touched or it break random stuffs!). |
Definitely. We can fix real bugs in distutils, but sometimes it’s best to avoid disruptive changes and let distutils with its buggy behavior and let the packaging module have the best behavior. |
Éric Araujo wrote:
This is a real bug, since we agreed long ago that distutils should |
Now that installing scripts with unicode characters was fixed, shall I open a separate bug for writing egg files with utf8 chars in author name? |
Please file a separate issue. |
pkginfo_utf8.patch: distutils uses UTF-8 to write PKG-INFO and .egg-info, instead of the locale encoding. It should be applied to 2.7, 3.2 and 3.3. packaging_pkginfo_utf8.patch: packaging tests use UTF-8 to write PKG-INFO files, instead of the locale encoding (cosmetic change, the file content is an empty string :-)). It should only be applied to 3.3 (packaging has been introduced in Python 3.3). |
Okay. I guess you’ll use codecs.open in 2.7; please make sure there is no bootstrapping issue for the build of CPython itself. It would be a good thing to have non-ASCII in the PGK-INFO/METADATA files in the tests; it’s how we caught bpo-12320. |
Oh, Python 2.7... DistributionMetadata of distutils encodes most values to byte strings (get_xxx() methods calls self._encode_field). It would be possible to use codecs.open(), but an Unicode file expects Unicode strings. The problem is that the user may provide arbitrary byte strings, I mean strings not encoded to PKG_INFO_ENCODING. Even if such strings are *wrong* (not correctly encoded), is it a good idea to be more strict in a minor version (2.7.x)? I don't want to be responsible of such tricky change, I prefer to leave distutils unchanged in Python 2.7 (at least for PKG-INFO).
I checked, there is not bootstrap issue. |
|
Ping. What's the progress on this? Will this ever be fixed? |
Some functions has been fixed in the new packaging module, but not in |
New changeset fb4d2e6d393e by Victor Stinner in branch '3.2': New changeset 3c080bf75342 by Victor Stinner in branch 'default': |
I applied pkginfo_utf8.patch to Python 3.2 and 3.3. Python 2.7 is not affected, it does already encode Unicode to UTF-8. |
New changeset 56ab3257ca13 by Victor Stinner in branch 'default': |
If you apply patches to distutils, please add tests for the fixed behavior. (Sorry if I wasn’t reactive on this one.) |
I backported your last change to distutils2 as f5a74b1f9473. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: