Issue 40996: urllib should fsdecode percent-encoded parts of file URIs on Unix

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/85168

classification

Title:	urllib should fsdecode percent-encoded parts of file URIs on Unix
Type:	behavior	Stage:
Components:	Library (Lib), Unicode	Versions:

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, mjacob
Priority:	normal	Keywords:

Created on 2020-06-17 00:19 by mjacob, last changed 2022-04-11 14:59 by admin.

Messages (1)
msg371702 - (view)	Author: Manuel Jacob (mjacob) *	Date: 2020-06-17 00:19
On Unix, file names are bytes. Python mostly prefers to use unicode for file names. On the Python <-> system boundary, os.fsencode() / os.fsdecode() are used. In URIs, bytes can be percent-encoded. On Unix, most applications pass the percent-decoded bytes in file URIs to the file system unchanged. The remainder of this issue description is about Unix, except for the last paragraph. Pathlib fsencodes the path when making a file URI, roundtripping the bytes e.g. passed as an argument: % python3 -c 'import pathlib, sys; print(pathlib.Path(sys.argv[1]).as_uri())' /tmp/a$(echo -e '\xE4') file:///tmp/a%E4 Example with curl using this URL: % echo 'Hello, World!' > /tmp/a$(echo -e '\xE4') % curl file:///tmp/a%E4 Hello, World! Python 2’s urllib works the same: % python2 -c 'from urllib import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))' 'Hello, World!\n' However, Python 3’s urllib fails: % python3 -c 'from urllib.request import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))' Traceback (most recent call last): File "/usr/lib/python3.8/urllib/request.py", line 1507, in open_local_file stats = os.stat(localfile) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/a�' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.8/urllib/request.py", line 525, in open response = self._open(req, data) File "/usr/lib/python3.8/urllib/request.py", line 542, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/usr/lib/python3.8/urllib/request.py", line 1485, in file_open return self.open_local_file(req) File "/usr/lib/python3.8/urllib/request.py", line 1524, in open_local_file raise URLError(exp) urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/tmp/a�'> urllib.request.url2pathname() is the function converting the path of the file URI to a file name. On Unix, it uses urllib.parse.unquote() with the default settings (UTF-8 encoding and the "replace" error handler). I think that on Unix, the settings from os.fsdecode() should be used, so that it roundtrips with pathlib.Path.as_uri() and so that the percent-decoded bytes are passed to the file system as-is. On Windows, I couldn’t do experiments, but using UTF-8 seems like the right thing (according to https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). I’m not sure that the "replace" error handler is a good idea. I prefer "errors should never pass silently" from the Zen of Python, but I don’t a have a strong opinion on this.

msg371702 - (view)

Author: Manuel Jacob (mjacob) *

Date: 2020-06-17 00:19

On Unix, file names are bytes. Python mostly prefers to use unicode for file names. On the Python <-> system boundary, os.fsencode() / os.fsdecode() are used.

In URIs, bytes can be percent-encoded. On Unix, most applications pass the percent-decoded bytes in file URIs to the file system unchanged. The remainder of this issue description is about Unix, except for the last paragraph.

Pathlib fsencodes the path when making a file URI, roundtripping the bytes e.g. passed as an argument:
% python3 -c 'import pathlib, sys; print(pathlib.Path(sys.argv[1]).as_uri())' /tmp/a$(echo -e '\xE4')
file:///tmp/a%E4

Example with curl using this URL:
% echo 'Hello, World!' > /tmp/a$(echo -e '\xE4')
% curl file:///tmp/a%E4
Hello, World!

Python 2’s urllib works the same:
% python2 -c 'from urllib import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))'
'Hello, World!\n'

However, Python 3’s urllib fails:
% python3 -c 'from urllib.request import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))' 
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1507, in open_local_file
    stats = os.stat(localfile)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/a�'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1485, in file_open
    return self.open_local_file(req)
  File "/usr/lib/python3.8/urllib/request.py", line 1524, in open_local_file
    raise URLError(exp)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/tmp/a�'>

urllib.request.url2pathname() is the function converting the path of the file URI to a file name. On Unix, it uses urllib.parse.unquote() with the default settings (UTF-8 encoding and the "replace" error handler).

I think that on Unix, the settings from os.fsdecode() should be used, so that it roundtrips with pathlib.Path.as_uri() and so that the percent-decoded bytes are passed to the file system as-is.

On Windows, I couldn’t do experiments, but using UTF-8 seems like the right thing (according to https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). I’m not sure that the "replace" error handler is a good idea. I prefer "errors should never pass silently" from the Zen of Python, but I don’t a have a strong opinion on this.

History
Date	User	Action	Args
2022-04-11 14:59:32	admin	set	github: 85168
2020-06-17 10:14:42	vstinner	set	nosy: - vstinner
2020-06-17 00:19:10	mjacob	create