"python -m pydoc -w" fails in nondecodable directory #69371

serhiy-storchaka · 2015-09-19T21:07:25Z

BPO	25184
Nosy	@orsenthil, @pitrou, @vstinner, @vadmium, @serhiy-storchaka
Files	pydoc_undecodabple_path.patch pydoc_undecodabple_path_2.patch: Using pathlib pydoc_iri.patch: IRI file: link

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-09-19.21:07:25.093>
labels = ['3.7', '3.8', 'type-bug', 'library']
title = '"python -m pydoc -w" fails in nondecodable directory'
updated_at = <Date 2018-12-20.12:28:32.594>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2018-12-20.12:28:32.594>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2015-09-19.21:07:25.093>
creator = 'serhiy.storchaka'
dependencies = []
files = ['40534', '40600', '40636']
hgrepos = []
issue_num = 25184
keywords = ['patch']
message_count = 9.0
messages = ['251117', '251212', '251213', '251226', '251271', '251721', '251731', '251985', '332221']
nosy_count = 6.0
nosy_names = ['orsenthil', 'pitrou', 'vstinner', 'Arfrever', 'martin.panter', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue25184'
versions = ['Python 3.7', 'Python 3.8']

serhiy-storchaka · 2015-09-19T21:07:25Z

$ pwd
/home/serhiy/py/cpy�thon-3.5
$ ./python -m pydoc -w pydoc
Traceback (most recent call last):
  File "/home/serhiy/py/cpy\udcffthon-3.5/Lib/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/serhiy/py/cpy\udcffthon-3.5/Lib/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/serhiy/py/cpy\udcffthon-3.5/Lib/pydoc.py", line 2648, in <module>
    cli()
  File "/home/serhiy/py/cpy\udcffthon-3.5/Lib/pydoc.py", line 2611, in cli
    writedoc(arg)
  File "/home/serhiy/py/cpy\udcffthon-3.5/Lib/pydoc.py", line 1642, in writedoc
    page = html.page(describe(object), html.document(object, name))
  File "/home/serhiy/py/cpy\udcffthon-3.5/Lib/pydoc.py", line 370, in document
    if inspect.ismodule(object): return self.docmodule(*args)
  File "/home/serhiy/py/cpy\udcffthon-3.5/Lib/pydoc.py", line 651, in docmodule
    url = urllib.parse.quote(path)
  File "/home/serhiy/py/cpy\udcffthon-3.5/Lib/urllib/parse.py", line 706, in quote
    string = string.encode(encoding, errors)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 19: surrogates not allowed

vadmium · 2015-09-21T07:50:33Z

Seems to be caused by the Python directory being non-decodable; the current working directory does not matter. What is going on is “pydoc” is trying to make a link to a module’s source code, such as

<a href="file:/usr/lib/python3.4/pydoc.py">/usr/lib/python3.4/pydoc.py</a>

For non-decodable paths, the following would work in Firefox:

<a href="file:/home/serhiy/py/cpy%FFthon-3.5/Lib/pydoc.py">/home/serhiy/py/cpy�thon-3.5/Lib/pydoc.py</a>

but since URL percent encoding already uses UTF-8, this scheme isn’t foolproof (e.g. a UTF-8 sequence when the locale is ASCII would be ambiguous). A simpler and more consistent way forward would be an error handler substituting something like this, decoding the surrogate escape code with the “replace” handler, with the HTML link suppressed:

/home/serhiy/py/cpy�thon-3.5/Lib/pydoc.py (invalid filename encoding)

vstinner · 2015-09-21T08:00:28Z

Technically, I think that it's possible to put bytes in an URL using %HH format. I didn't check if we can retrieve the "raw bytes".

serhiy-storchaka · 2015-09-21T13:37:35Z

We could use url = urllib.parse.quote_from_bytes(os.fsencode(path)) on Posix systems, but I heart that on Windows os.fsencode() can irreversible spoil file names (replace unencodable characters with '?'). On other side, I'm not sure that Windows unicode path can't contain lone surrogates. In this case we should use the 'surrogatepass' error handler (at least it allow to restore the path in principle).

Here is a patch that tries to handle undecodable and unencodable paths. Need to test it on Windows.

vadmium · 2015-09-21T22:31:22Z

Serhiy’s patch essentially uses the local filesystem encoding and then percent encoding, rather than the current behaviour of strict UTF-8 encoding and percent encoding. This is similar to what the “pathlib” make_uri() methods do, so maybe we could let “pathlib” do the work instead.

This draft RFC discusses encoding “file:” URLs:

https://tools.ietf.org/html/draft-ietf-appsawg-file-scheme-03#section-4

It suggests leaving Unicode characters alone (in IRIs) if possible, or using UTF-8 and percent encoding even if the filesystem uses a non-UTF-8 encoding. Perhaps we could leave the filename in the HTML as Unicode characters without percent encoding, and only percent encode the undecodable (surrogate-escaped) bytes.

This “IRI” scheme is also recommended by <http://blogs.msdn.com/b/ie/archive/2006/12/06/file-uris-in-windows.aspx\>, which says on Windows, “in file URIs, percent-encoded octets are interpreted as a byte in the user’s current codepage”. This contradicts the draft RFC and the “pathlib” implementation, which both use UTF-8.

serhiy-storchaka · 2015-09-27T19:56:00Z

Yes, perhaps using pathlib is more correct. Updated patch uses pathlib. Can you please test it on Windows?

Can filenames on Windows contain lone surrogates?

vadmium · 2015-09-27T22:26:30Z

I don’t have much to do with Windows, but I understand we don’t support surrogate-escaped bytes there. E.g. see <https://docs.python.org/dev/library/os.html#os.fsdecode\> and sys.getfilesystemencoding(). However I suspect your first patch would have failed on Windows doing os.fsencode(TESTFN_UNENCODABLE); apparently it cannot represent all possible file names in bytes. The second patch doesn’t call fsencode() so this shouldn’t be a problem.

Your tests do not test that the output is valid Unicode without surrogates. With your first patch applied, when pydoc wrote the HTML to a UTF-8 disk file, I got the error:

File "/media/disk/home/proj/python/cpy\udcffthon/Lib/pydoc.py", line 2626, in cli
writedoc(arg)
File "/media/disk/home/proj/python/cpy\udcffthon/Lib/pydoc.py", line 1659, in writedoc
file.write(page)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 674: surrogates not allowed

I have been working on an alternative patch using my IRI (Unicode URLs) proposal for “file:” links, and “surrogatepass” for HTTP links. But I am also trying to fix some related problems with the built-in HTTP server, and the unit tests are a bit tricky.

vadmium · 2015-10-01T00:49:37Z

Here is a patch that implements my IRI (Unicode URL) proposal. Differences compared to Serhiy’s patches:

“file:” URLs use Unicode if possible. Percent encode encoding only used for reserved ASCII characters and undecodable bytes.
HTTP URLs use UTF-8 and retain surrogate escaping. This means that the links generated by “getobj” pages to “getfile” pages work with troublesome paths.
Displayed file names are get an ASCII question mark thanks to the “replace” error handler. This means that the generated HTML is now encodable to UTF-8. Serhiy’s path would be displayed something like:

/home/serhiy/py/cpy?thon-3.5/Lib/pydoc.py (invalid filename encoding)

Added some missing html.escape() calls.

serhiy-storchaka · 2018-12-20T12:28:17Z

Could you please create a PR for your path Martin?

serhiy-storchaka added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Sep 19, 2015

serhiy-storchaka added 3.7 (EOL) end of life 3.8 only security fixes labels Dec 20, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"python -m pydoc -w" fails in nondecodable directory #69371

"python -m pydoc -w" fails in nondecodable directory #69371

serhiy-storchaka commented Sep 19, 2015

serhiy-storchaka commented Sep 19, 2015

vadmium commented Sep 21, 2015

vstinner commented Sep 21, 2015

serhiy-storchaka commented Sep 21, 2015

vadmium commented Sep 21, 2015

serhiy-storchaka commented Sep 27, 2015

vadmium commented Sep 27, 2015

vadmium commented Oct 1, 2015

serhiy-storchaka commented Dec 20, 2018

"python -m pydoc -w" fails in nondecodable directory #69371

"python -m pydoc -w" fails in nondecodable directory #69371

Comments

serhiy-storchaka commented Sep 19, 2015

serhiy-storchaka commented Sep 19, 2015

vadmium commented Sep 21, 2015

vstinner commented Sep 21, 2015

serhiy-storchaka commented Sep 21, 2015

vadmium commented Sep 21, 2015

serhiy-storchaka commented Sep 27, 2015

vadmium commented Sep 27, 2015

vadmium commented Oct 1, 2015

serhiy-storchaka commented Dec 20, 2018