classification
Title: urllib.request.url2pathname() unconditionally uses utf-8 encoding and "replace" error handler
Type: Stage: resolved
Components: Library (Lib) Versions:
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: mjacob
Priority: normal Keywords:

Created on 2020-06-15 11:23 by mjacob, last changed 2020-06-17 00:32 by mjacob. This issue is now closed.

Messages (2)
msg371537 - (view) Author: Manuel Jacob (mjacob) * Date: 2020-06-15 11:23
On Python 2, it was possible to recover a percent-encoded byte:
>>> from urllib import url2pathname
>>> url2pathname('%ff')
'\xff'

On Python 3, the byte is decoded using the utf-8 encoding and the "replace" error handler (therefore there’s no way to recover the byte):
>>> from urllib.request import url2pathname
>>> url2pathname('%ff')
'�'

For my use case (getting the pathname as bytes), it would be sufficient to specify a different encoding (e.g. latin-1) or a different error handler (e.g. surrogateescape) that makes it possible to recover the byte by encoding the result of url2pathname() such that it roundtrips with the encoding and error handler internally used by url2pathname() for percent-encoded bytes.

I’m not simply sending a patch, because this might point to a deeper issue. Suppose there’s the following script:

import sys
from pathlib import Path
from urllib.request import urlopen
path = Path(sys.argv[1])
path.write_text('Hello, World!')
with urlopen(path.as_uri()) as resp:
    print(resp.read())

If I call this script with b'/tmp/\xff' as the argument, it fails with the following traceback:

Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1507, in open_local_file
    stats = os.stat(localfile)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/�'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_url2pathname.py", line 6, in <module>
    with urlopen(path.as_uri()) as resp:
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1485, in file_open
    return self.open_local_file(req)
  File "/usr/lib/python3.8/urllib/request.py", line 1524, in open_local_file
    raise URLError(exp)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/tmp/�'>

So maybe urllib.request.url2pathname() should use the same encoding and error handler as os.fsencode() / os.fsdecode().
msg371706 - (view) Author: Manuel Jacob (mjacob) * Date: 2020-06-17 00:32
I’ve created issue40996, which suggests that urllib should fsdecode percent-encoded parts of file URIs on Unix. Since the two tickets are very related and I’d prefer if the issue was solved more generally for the whole module, I close this as a duplicate.
History
Date User Action Args
2020-06-17 00:32:27mjacobsetstatus: open -> closed
resolution: duplicate
messages: + msg371706

stage: resolved
2020-06-16 11:16:10mjacobsettitle: Can’t configure encoding used by urllib.request.url2pathname() -> urllib.request.url2pathname() unconditionally uses utf-8 encoding and "replace" error handler
2020-06-15 11:23:42mjacobcreate