classification
Title: distutils sdist command fails to create MANIFEST if any filenames are undecodable
Type: enhancement Stage: resolved
Components: Distutils Versions: Python 3.9, Python 3.8
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: a.badger, dstufft, eric.araujo, jaraco, steve.dower
Priority: normal Keywords: patch

Created on 2019-05-21 17:33 by a.badger, last changed 2021-02-03 18:10 by steve.dower. This issue is now closed.

Files
File name Uploaded Description Edit
test-case.tar.gz a.badger, 2019-05-21 23:32
Pull Requests
URL Status Linked Edit
PR 13467 closed a.badger, 2019-05-21 17:41
Messages (13)
msg343074 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2019-05-21 17:33
An sdist may contain files whose names are undecodable in the current locale.  For instance, the sdist might include some files for testing whose filenames are undecodable because that's the format of the input for that application.

Currently, trying to create the sdist fails with output similar to this:

Traceback (most recent call last):
  File "setup.py", line 330, in <module>
    main()
  File "setup.py", line 325, in main
    setup(**setup_params)
  File "/home/badger/.local/lib/python3.5/site-packages/setuptools/__init__.py", line 145, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.5/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.5/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "setup.py", line 137, in run
    SDist.run(self)
  File "/usr/lib/python3.5/distutils/command/sdist.py", line 158, in run
    self.get_file_list()
  File "/usr/lib/python3.5/distutils/command/sdist.py", line 214, in get_file_list
    self.write_manifest()
  File "/usr/lib/python3.5/distutils/command/sdist.py", line 362, in write_manifest
    "writing manifest file '%s'" % self.manifest)
  File "/usr/lib/python3.5/distutils/cmd.py", line 336, in execute
    util.execute(func, args, msg, dry_run=self.dry_run)
  File "/usr/lib/python3.5/distutils/util.py", line 301, in execute
    func(*args)
  File "/usr/lib/python3.5/distutils/file_util.py", line 236, in write_file
    f.write(line + "\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 45-46: ordinal not in range(128)

(I replicated the failure case by setting my locale to POSIX and using a standard utf-8 filename but this also applies to having a filename that is not actually text in any locale... as I said, filenames used for testing can run the gamut of odd choices).

This traceback is interesting as it occurs during writing of the MANIFEST.  That shows that the undecodable file is read in correctly.  It's only when writing the file that it fails.  Some further debugging showed me that the filename is read in using the surrogateescape error handler.  So we can round trip the filename by using the surrogateescase error handler when writing it out.

I tested making the following change:

-    f = open(filename, "w")
+    f = open(filename, "w", errors="surrogateescape")

and sure enough, the sdist is now created correctly.

I'll submit a PR to fix this.
msg343101 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-05-21 21:40
The traceback comes from Python 3.5. Python 3.7 now uses UTF-8 when the LC_CTYPE locale is "C" or "POSIX", thanks to C locale coercion (PEP 538) and UTF-8 Mode (PEP 540):
https://vstinner.github.io/posix-locale.html
https://vstinner.github.io/python37-new-utf8-mode.html

Can you please retry your test using Python 3.7?

I would prefer to not start to use errors="surrogateescape" everywhere, especially if the issue is already fixed.

errors="surrogateescape" basically means: "I want to produce a file that other people will unable to read...

---

def write_file (filename, contents):
    """Create a file with the specified name and write 'contents' (a
    sequence of strings without line terminators) to it.
    """
    f = open(filename, "w")
    ...

For me, the first obvious bug is that the encoding is not specified. I don't see how an application on a different computer is supposed to read a text file if the encoding is not well defined...
msg343102 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-05-21 21:41
IMHO MANIFEST should always be encoded to UTF-8 on all platforms. It means always *decode* it from UTF-8.

For backward compatibility, we might first try to decode it from UTF-8, but falls back on the locale encoding if the UTF-8 decoder raises a UnicodeDecodeError.
msg343107 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2019-05-21 22:21
I like the idea of defaulting to UTF-8 (although I think you'll have to build consensus as to whether that's the right thing to do here) but it won't handle the use case here.  There's a need to handle files which are undecodable and encoding to utf-8 won't fix that.
msg343115 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-05-21 22:38
> There's a need to handle files which are undecodable and encoding to utf-8 won't fix that.

Are you talking about the initial issue, get_file_list() which fails on writing into the MANIFEST file?

For me, a package should be "portable": it should work on most if not all platforms, and so use common characters which are encodable by most locale encodings, no?

What is your use case? Package something on Linux and then install it on Linux? What is your locale?

Did you try Python 3.7 or newer which speaks UTF-8 when the locale is not set?
msg343123 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2019-05-21 22:58
From my initial description:

"An sdist may contain files whose names are undecodable in the current locale.  For instance, the sdist might include some files for testing whose filenames are undecodable because that's the format of the input for that application."
msg343126 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-05-21 23:18
√Čric Araujo asked me to review https://github.com/python/cpython/pull/13467. I gave my feedback. I now remove myself from this issue, since I'm not sure that I would like to be involved in a packaging module (distutils) :-) I let others to take a decision on these topics ;-)
msg343127 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2019-05-21 23:32
Uploading a minimal test case.

$ tar -xzvf test-case.tar.gz
$ python3.7 setup.py sdist
running sdist
running check
warning: sdist: standard file not found: should have one of README, README.txt, README.rst

reading manifest template 'MANIFEST.in'
writing manifest file 'MANIFEST'
Traceback (most recent call last):
  File "setup.py", line 27, in <module>
    packages=['hello'],
  File "/usr/lib64/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib64/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/lib64/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/lib64/python3.7/distutils/command/sdist.py", line 152, in run
    self.get_file_list()
  File "/usr/lib64/python3.7/distutils/command/sdist.py", line 208, in get_file_list
    self.write_manifest()
  File "/usr/lib64/python3.7/distutils/command/sdist.py", line 390, in write_manifest
    "writing manifest file '%s'" % self.manifest)
  File "/usr/lib64/python3.7/distutils/cmd.py", line 335, in execute
    util.execute(func, args, msg, dry_run=self.dry_run)
  File "/usr/lib64/python3.7/distutils/util.py", line 291, in execute
    func(*args)
  File "/usr/lib64/python3.7/distutils/file_util.py", line 236, in write_file
    f.write(line + "\n")
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 11: surrogates not allowed

With PR applied:

$ ../../python setup.py sdist
running sdist
running check
warning: sdist: standard file not found: should have one of README, README.txt, README.rst

reading manifest template 'MANIFEST.in'
writing manifest file 'MANIFEST'
creating hello-1.0
creating hello-1.0/hello
creating hello-1.0/tests
creating hello-1.0/tests/data
making hard links in hello-1.0...
hard linking setup.py -> hello-1.0
hard linking hello/__init__.py -> hello-1.0/hello
hard linking tests/test_cases.py -> hello-1.0/tests
hard linking tests/data/1.bin -> hello-1.0/tests/data
hard linking tests/data/\udcff.bin -> hello-1.0/tests/data
Creating tar archive
removing 'hello-1.0' (and everything under it)

Making this minimal test case, though, I found that there's another error somewhere when MANIFEST has already been created (ie: the patched version works for the initial generation of MANIFEST but it doesn't work to *regenerate* the MANIFEST).  Looking into that now.
msg343129 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2019-05-21 23:39
Okay, pushed a fix for regenerating the MANIFEST as well.
msg343132 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2019-05-21 23:54
Are the distutils unittests disabled?  And if so is there a reason?  I was looking to add test cases to my PR and found that I couldn't get them (or indeed any distutils unittests) to run when trying to only target the distutils unittests.

From looking at the code my guess is that the Python test suite was ported to use the load_test protocol sometime after Python-3.2 but distutils was missed.  I can submit a PR to change that unless there's a reason it is the way it is.
msg343135 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2019-05-22 00:14
Figured out the answer to my last question while looking into fixing it.  The devguide documents both running tests via regrtest and running them via unittest test discovery.  regrtest works:

  ./python -m test -v distutils.test

But unittest doesn't:
  ./python -m unittest -v test.test_distutils
  ./python -m unittest -v distutils.test.test_file_utils
  # etc

I'll submit a separate PR to get that working.
msg351920 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2019-09-11 15:03
Thanks all for investigating and discussing this issue. I'm reviewing the PR, and I can't help but wonder, does this same issue exist when building with setuptools? Do you care about older Python versions? Would addressing these issues in setuptools address your needs?

Additionally, the main use-case you describe is someone wishes to create an sdist referencing invalid filenames. That seems like an undesirable case. Could such a project not instead generate these invalid filenames on demand rather than materializing them in a distribution package?

I'm also concerned that by altering `write_file`, we're applying a change for a specific use-case to a utility probably shared by several use-cases.

Overall, I'm leaning -1 on this right now. Can you convince me this change is the best thing for the system?
msg386273 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-02-03 18:10
Distutils is now deprecated (see PEP 632) and all tagged issues are being closed. From now until removal, only release blocking issues will be considered for distutils.

If this issue does not relate to distutils, please remove the component and reopen it. If you believe it still requires a fix, most likely the issue should be re-reported at https://github.com/pypa/setuptools
History
Date User Action Args
2021-02-03 18:10:28steve.dowersetstatus: open -> closed

nosy: + steve.dower
messages: + msg386273

resolution: out of date
stage: patch review -> resolved
2019-09-11 15:03:36jaracosetversions: - Python 3.5, Python 3.6, Python 3.7
nosy: + jaraco

messages: + msg351920

type: enhancement
2019-05-22 00:14:24a.badgersetmessages: + msg343135
2019-05-22 00:03:03vstinnersetnosy: - vstinner
2019-05-21 23:54:58a.badgersetnosy: + vstinner
messages: + msg343132
2019-05-21 23:43:21vstinnersetnosy: - vstinner
2019-05-21 23:39:22a.badgersetmessages: + msg343129
2019-05-21 23:32:03a.badgersetfiles: + test-case.tar.gz

messages: + msg343127
2019-05-21 23:18:16vstinnersetnosy: vstinner, eric.araujo, a.badger, dstufft
messages: + msg343126
2019-05-21 22:58:48a.badgersetmessages: + msg343123
2019-05-21 22:38:13vstinnersetmessages: + msg343115
2019-05-21 22:21:13a.badgersetmessages: + msg343107
2019-05-21 21:41:25vstinnersetmessages: + msg343102
2019-05-21 21:40:03vstinnersetnosy: + vstinner
messages: + msg343101
2019-05-21 17:41:00a.badgersetkeywords: + patch
stage: patch review
pull_requests: + pull_request13378
2019-05-21 17:33:06a.badgercreate