This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.request.urlopen doesn't handle UNC paths produced by pathlib's as_uri() (but can handle UNC paths with additional slashes)
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Alex.Willmer, asvetlov, barneygale, barry, docs@python, dstufft, eric.araujo, eryksun, ezio.melotti, ikelos, koobs, ladykraken, larry, lys.nikolaou, mrabarnett, ned.deily, pablogsal, paul.moore, r.david.murray, ronaldoussoren, steve.dower, terry.reedy, tim.golden, vstinner, yselivanov, zach.ware
Priority: normal Keywords:

Created on 2022-02-05 22:27 by ikelos, last changed 2022-04-11 14:59 by admin.

Messages (11)
msg412602 - (view) Author: Mike Auty (ikelos) Date: 2022-02-05 22:27
I've found open to have difficulty with a resolved pathlib path:

Example code of:

   import pathlib
   path = "Z:\\test.py"
   with open(path) as fp:
       print("Stock open: works")
       data = fp.read()
   with open(pathlib.Path(path).resolve().as_uri()) as fp:
       print("Pathlib resolve open")
       data = fp.read()

Results in:

Z:\> python test.py
Stock open: works
Traceback (most recent call last):
  File "Z:\test.py", line 12, in <module>
    with open(pathlib.Path(path).resolve().as_uri()) as fp:
FileNotFoundError: [Errno 2] No such file or directory: "file://machine/share/test.py"

Interestingly, I've found that open("file:////machine/share/test.py") succeeds, but this isn't what pathlib's resolve() produces.  It appears as though file_open only supports hosts that are local, but will open UNC paths on windows with the additional slashes.  This is quite confusing behaviour and it's not clear why file://host/share/file won't work, but file:////host/share/file does.

I imagine this is a long time issue and a decision has already been reached on why file_open doesn't support such URIs, but I couldn't find the answer anywhere, just issue 32442 which was resolved without clarifying the situation...
msg412603 - (view) Author: Barney Gale (barneygale) * Date: 2022-02-05 23:07
Why are you adding `.as_uri()`?
msg412605 - (view) Author: Mike Auty (ikelos) Date: 2022-02-05 23:32
> Why are you adding `.as_uri()`?

The API we provide accepts URIs, so whilst the example seems a little contrived, the code itself expects a URI and then calls open (making use of the ability to add open handlers).

> Builtin open() calls C open().

As best I can tell the file handler is defined in urllib/request.py as file_open.  This appears to do some preprocessing to remove the file scheme and (and explicitly throws an exception if there's a host that isn't localhost) before it gets to the C open().  I wondered why it didn't check if it was on windows and, if so, construct an appropriate path (since quadruple hash I don't think adheres to the URI RFC, but seems to open correctly)?
msg412606 - (view) Author: Mike Auty (ikelos) Date: 2022-02-05 23:34
My bad, sorry, I realized I was conflating open with urllib.request.urlopen.  I believe the issue still exists though, sorry for the confusion.
msg412607 - (view) Author: Mike Auty (ikelos) Date: 2022-02-05 23:45
Here's the revised code sample:

    import pathlib
    import urllib.request
    
    path = "Z:\\test.py"
    
    print(f"Stock open: {pathlib.Path(path).as_uri()}")
    with urllib.request.urlopen(pathlib.Path(path).as_uri()) as fp:
        data = fp.read()
    
    print(f"Pathlib resolved open: {pathlib.Path(path).resolve().as_uri()}")
    with urllib.request.urlopen(pathlib.Path(path).resolve().as_uri()) as fp:
        data = fp.read()

and here's the output:

    Z:\> python test.py
    Stock open: file:///Z:/test.py
    Pathlib resolved open: file://host/share/test.py
    Traceback (most recent call last):
    File "C:\Program Files\Python310\lib\urllib\request.py", line 1505, in open_local_file
        stats = os.stat(localfile)
    FileNotFoundError: [WinError 2] The system cannot find the file specified: '\\share\\test.py'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File "Z:\test.py", line 14, in <module>
        with urllib.request.urlopen(pathlib.Path(path).resolve().as_uri()) as fp:
    File "C:\Program Files\Python310\lib\urllib\request.py", line 216, in urlopen
        return opener.open(url, data, timeout)
    File "C:\Program Files\Python310\lib\urllib\request.py", line 519, in open
        response = self._open(req, data)
    File "C:\Program Files\Python310\lib\urllib\request.py", line 536, in _open
        result = self._call_chain(self.handle_open, protocol, protocol +
    File "C:\Program Files\Python310\lib\urllib\request.py", line 496, in _call_chain
        result = func(*args)
    File "C:\Program Files\Python310\lib\urllib\request.py", line 1483, in file_open
        return self.open_local_file(req)
    File "C:\Program Files\Python310\lib\urllib\request.py", line 1522, in open_local_file
        raise URLError(exp)
    urllib.error.URLError: <urlopen error [WinError 2] The system cannot find the file specified: '\\share\\test.py'>
msg412608 - (view) Author: Barney Gale (barneygale) * Date: 2022-02-06 00:10
urllib uses nturl2path under the hood. On my system it seems to return reasonable results for both two and four leading slashes:

    >>> nturl2path.url2pathname('////host/share/test.py')
    '\\\\host\\share\\test.py'
    >>> nturl2path.url2pathname('//host/share/test.py')
    '\\\\host\\share\\test.py'

(note that urllib strips the `file:` prefix before calling this function).
msg412609 - (view) Author: Mike Auty (ikelos) Date: 2022-02-06 01:31
I can confirm that url2pathname work with either number of slashes, and that open_file appears to have had the file: removed.

However, in even if the check in open_file were bypassed, it calls open_local_file, which then strips any host before calling url2pathname, meaning the host will never be included if only two slashes are used.

        host, file = _splithost(url)
        localname = url2pathname(file)

This is what seems to cause the issue when attempting to open file://server/host/file.ext on windows, even though file:////server/host/file.ext open just fine.

The problem that I found, and was in bug #32442, is that pathlib only ever returns two slashes, which despite being a valid and correctly formed url, can't be opened by urllib.request.urlopen().  Since there doesn't seem to be an issue with opening these files (given it works for file:////server...) and since nt2pathname will produce the correct result, it feels as though open_file should have special code on windows to allow servers to be accepted by the file handler (open_local_file should probably stay as is to not change the API too much).
msg412610 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-02-06 01:45
In FileHandler.file_open(), req.host is the host name, which is either None or an empty string for a local drive path such as, respectively, "file:/Z:/test.py" or "file:///Z:/test.py". The value of req.selector never starts with "//", for which file_open() checks, but rather a single slash, such as "/Z:/test.py" or "/share/test.py". This is a bug in file_open(). Due to this bug, it always calls self.open_local_file(req), even if req.host isn't local. The distinction shouldn't matter in Windows, which supports UNC paths, but POSIX has to open a path on the local machine (possibly a mount point for a remote path, but that's irrelevant). In POSIX, if the local machine coincidentally has the req.selector path, then the os.stat() and open() calls will succeed with a bogus result.

For "file://host/share/test.py", req.selector is "/share/test.py". In Windows, url2pathname() converts this to r"\share\test.py", which is relative to the drive of the process current working directory. This is a bug in open_local_file() on Windows. For it to work correctly, req.host has to be joined back with req.selector as the UNC path "//host/share/test.py". Of course, this need not be a local file in Windows, so Windows should be exempted from the local file limitation in file_open().
msg412612 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-02-06 02:42
> file://server/host/file.ext on windows, even though 
> file:////server/host/file.ext open just fine.

For r"\\host\share\test.py", the two slash conversion "file://host/share/test.py" is correct according to RFC80889 "E.3.1. <file> URI with Authority" [1]. In this case, req.host is "host", and req.selector is "/share/test.py". 

The four slash version "file:////host/share/test.py" is a known variant for a converted UNC path, as noted in RFC8089 "E.3.2. <file> URI with UNC Path". In this case, req.host is an empty string, and req.selector is "//host/share/test.py". There's another variant that uses 5 slashes for a UNC path, but urllib (or url2pathname) doesn't support it.

---
[1] https://datatracker.ietf.org/doc/html/rfc8089
msg412613 - (view) Author: Barney Gale (barneygale) * Date: 2022-02-06 02:47
Agree with the previous analysis. Just noting that:

    >>> nturl2path.pathname2url('\\\\host\\share\\test.py')
    '////host/share/test.py'

So four slashes are produced by the urllib code, whereas pathlib only produces two.

According to wikipedia, both the two- and four-slash variants are in active usage. As we can see within Python itself! :P
msg412619 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-02-06 09:17
> The value of req.selector never starts with "//", for which file_open() 
> checks, but rather a single slash, such as "/Z:/test.py" or 
> "/share/test.py".

To correct myself, actually req.selector will start with "//" for a "file:////" URI, such as "file:////host/share/test.py". For this example, req.host is an empty string, so file_open() still ends up calling open_local_file(), which will open "//host/share/test.py". In Linux, "//host/share" is the same as "/host/share". In Cygwin and MSYS2 it's a UNC path. I guess this case should be allowed, even though the meaning of a "//" root isn't specifically defined in POSIX.

Unless I'm overlooking something, file_open() only has to check the value of req.host. In POSIX, it should require opening a 'local' path, i.e. if req.host isn't None, empty, or a local host, raise URLError.

In Windows, my tests show that the shell API special cases "localhost" (case insensitive) in "file:" URIs. For example, the following are all equivalent: "file:/C:/Temp", "file:///C:/Temp", and "file://localhost/C:/Temp". The shell API does not special case the real local host name or any of its IP addresses, such as 127.0.0.1. They're all handled as UNC paths.

Here's what I've experimented with thus far, which passes the existing urllib tests in Linux and Windows:

    class FileHandler(BaseHandler):
        def file_open(self, req):
            if not self._is_local_path(req):
                if sys.platform == 'win32':
                    path = url2pathname(f'//{req.host}{req.selector}')
                else:
                    raise URLError("In POSIX, the file:// scheme is only "
                                   "supported for local file paths.")
            else:
                path = url2pathname(req.selector)
            return self._common_open_file(req, path)


        def _is_local_path(self, req):
            if req.host:
                host, port = _splitport(req.host)
                if port:
                    raise URLError(f"the host cannot have a port: {req.host}")
                if host.lower() != 'localhost':
                    # In Windows, all other host names are UNC.
                    if sys.platform == 'win32':
                        return False
                    # In POSIX, support all names for the local host.
                    if _safe_gethostbyname(host) not in self.get_names():
                        return False
            return True


        # names for the localhost
        names = None
        def get_names(self):
            if FileHandler.names is None:
                try:
                    FileHandler.names = tuple(
                        socket.gethostbyname_ex('localhost')[2] +
                        socket.gethostbyname_ex(socket.gethostname())[2])
                except socket.gaierror:
                    FileHandler.names = (socket.gethostbyname('localhost'),)
            return FileHandler.names


        def open_local_file(self, req):
            if not self._is_local_path(req):
                raise URLError('file not on local host')
            return self._common_open_file(req, url2pathname(req.selector))


        def _common_open_file(self, req, path):
            import email.utils
            import mimetypes
            host = req.host
            filename = req.selector
            try:
                if host:
                    origurl = f'file://{host}{filename}'
                else:
                    origurl = f'file://{filename}'
                stats = os.stat(path)
                size = stats.st_size
                modified = email.utils.formatdate(stats.st_mtime, usegmt=True)
                mtype = mimetypes.guess_type(filename)[0] or 'text/plain'
                headers = email.message_from_string(
                            f'Content-type: {mtype}\n'
                            f'Content-length: {size}\n'
                            f'Last-modified: {modified}\n')
                return addinfourl(open(path, 'rb'), headers, origurl)
            except OSError as exp:
                raise URLError(exp)


Unfortunately nturl2path.url2pathname() parses some UNC paths incorrectly. For example, the following path should be an invalid UNC path, since "C:" is an invalid name, but instead it gets converted into an unrelated local path.

    >>> nturl2path.url2pathname('//host/C:/Temp/spam.txt')
    'C:\\Temp\\spam.txt'

This goof depends on finding ":" or "|" in the path. It's arguably worse if the last component has a named data stream (allowed by RFC 8089):

    >>> nturl2path.url2pathname('//host/share/spam.txt:eggs')
    'T:\\eggs'

Drive "T:" is from "t:" in "t:eggs", due to simplistic path parsing.
History
Date User Action Args
2022-04-11 14:59:55adminsetgithub: 90812
2022-02-06 22:09:17eryksunsetassignee: docs@python ->
stage: needs patch
type: performance -> behavior
components: - Build, Demos and Tools, Distutils, Documentation, Extension Modules, IDLE, Installation, Interpreter Core, macOS, Regular Expressions, Tests, Tkinter, Unicode, Windows, XML, 2to3 (2.x to 3.x conversion tool), ctypes, IO, Cross-Build, email, asyncio, Argument Clinic, FreeBSD, SSL, C API, Subinterpreters, Parser
versions: - Python 3.7, Python 3.8
2022-02-06 19:54:48ladykrakensetversions: + Python 3.7, Python 3.8, Python 3.9, Python 3.10, Python 3.11
nosy: + terry.reedy, larry, paul.moore, pablogsal, dstufft, asvetlov, ezio.melotti, koobs, yselivanov, zach.ware, steve.dower, ladykraken, ned.deily, barry, Alex.Willmer, eric.araujo, ronaldoussoren, lys.nikolaou, r.david.murray, docs@python, vstinner, tim.golden, mrabarnett

assignee: docs@python
components: + Build, Demos and Tools, Distutils, Documentation, Extension Modules, IDLE, Installation, Interpreter Core, Library (Lib), macOS, Regular Expressions, Tests, Tkinter, Unicode, Windows, XML, 2to3 (2.x to 3.x conversion tool), ctypes, IO, Cross-Build, email, asyncio, Argument Clinic, FreeBSD, SSL, C API, Subinterpreters, Parser
type: performance
2022-02-06 09:17:11eryksunsetmessages: + msg412619
2022-02-06 02:49:03barneygalesettitle: urllib.request.urlopen doesn't handle UNC paths produced by pathlib's resolve() (but can handle UNC paths with additional slashes) -> urllib.request.urlopen doesn't handle UNC paths produced by pathlib's as_uri() (but can handle UNC paths with additional slashes)
2022-02-06 02:47:25barneygalesetmessages: + msg412613
2022-02-06 02:42:03eryksunsetmessages: + msg412612
2022-02-06 01:45:25eryksunsetmessages: + msg412610
2022-02-06 01:33:53ikelossettitle: file_open doesn't handle UNC paths produced by pathlib's resolve() (but can handle UNC paths with additional slashes) -> urllib.request.urlopen doesn't handle UNC paths produced by pathlib's resolve() (but can handle UNC paths with additional slashes)
2022-02-06 01:31:58ikelossetmessages: + msg412609
2022-02-06 00:10:12barneygalesetmessages: + msg412608
2022-02-06 00:01:02eryksunsetmessages: - msg412604
2022-02-05 23:45:49ikelossetmessages: + msg412607
2022-02-05 23:34:51ikelossetmessages: + msg412606
2022-02-05 23:32:56ikelossetmessages: + msg412605
2022-02-05 23:20:21eryksunsetnosy: + eryksun
messages: + msg412604
2022-02-05 23:07:17barneygalesetnosy: + barneygale
messages: + msg412603
2022-02-05 22:27:39ikeloscreate