This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib should fsdecode percent-encoded parts of file URIs on Unix
Type: behavior Stage:
Components: Library (Lib), Unicode Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, mjacob
Priority: normal Keywords:

Created on 2020-06-17 00:19 by mjacob, last changed 2022-04-11 14:59 by admin.

Messages (1)
msg371702 - (view) Author: Manuel Jacob (mjacob) * Date: 2020-06-17 00:19
On Unix, file names are bytes. Python mostly prefers to use unicode for file names. On the Python <-> system boundary, os.fsencode() / os.fsdecode() are used.

In URIs, bytes can be percent-encoded. On Unix, most applications pass the percent-decoded bytes in file URIs to the file system unchanged. The remainder of this issue description is about Unix, except for the last paragraph.

Pathlib fsencodes the path when making a file URI, roundtripping the bytes e.g. passed as an argument:
% python3 -c 'import pathlib, sys; print(pathlib.Path(sys.argv[1]).as_uri())' /tmp/a$(echo -e '\xE4')
file:///tmp/a%E4

Example with curl using this URL:
% echo 'Hello, World!' > /tmp/a$(echo -e '\xE4')
% curl file:///tmp/a%E4
Hello, World!

Python 2’s urllib works the same:
% python2 -c 'from urllib import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))'
'Hello, World!\n'

However, Python 3’s urllib fails:
% python3 -c 'from urllib.request import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))' 
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1507, in open_local_file
    stats = os.stat(localfile)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/a�'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1485, in file_open
    return self.open_local_file(req)
  File "/usr/lib/python3.8/urllib/request.py", line 1524, in open_local_file
    raise URLError(exp)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/tmp/a�'>

urllib.request.url2pathname() is the function converting the path of the file URI to a file name. On Unix, it uses urllib.parse.unquote() with the default settings (UTF-8 encoding and the "replace" error handler).

I think that on Unix, the settings from os.fsdecode() should be used, so that it roundtrips with pathlib.Path.as_uri() and so that the percent-decoded bytes are passed to the file system as-is.

On Windows, I couldn’t do experiments, but using UTF-8 seems like the right thing (according to https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). I’m not sure that the "replace" error handler is a good idea. I prefer "errors should never pass silently" from the Zen of Python, but I don’t a have a strong opinion on this.
History
Date User Action Args
2022-04-11 14:59:32adminsetgithub: 85168
2020-06-17 10:14:42vstinnersetnosy: - vstinner
2020-06-17 00:19:10mjacobcreate