changeset: 41834:dc1f0a224f83 branch: trunk tag: tip user: Éric Araujo date: sam avr 25 22:14:46 2009 +0200 files: Doc/library/urlparse.rst Lib/urlparse.py description: * Removed references to obsolete RFCs in urlparse (issue 5650); * corrected “URL” to “URI” in the documentation and added link to explaining RFC; * changed “URL-encoding” to “percent-encoding” according to Wikipedia. diff --git a/Doc/library/urlparse.rst b/Doc/library/urlparse.rst --- a/Doc/library/urlparse.rst +++ b/Doc/library/urlparse.rst @@ -1,16 +1,19 @@ -:mod:`urlparse` --- Parse URLs into components -============================================== +:mod:`urlparse` --- Parse and build URIs +======================================== .. module:: urlparse - :synopsis: Parse URLs into or assemble them from components. + :synopsis: Parse URIs into or assemble them from components. .. index:: single: WWW single: World Wide Web single: URL + single: URI pair: URL; parsing + pari: URI; parsing pair: relative; URL + pair: relative; URI .. note:: The :mod:`urlparse` module is renamed to :mod:`urllib.parse` in Python 3.0. @@ -18,17 +21,17 @@ your sources to 3.0. -This module defines a standard interface to break Uniform Resource Locator (URL) -strings up in components (addressing scheme, network location, path etc.), to -combine the components back into a URL string, and to convert a "relative URL" -to an absolute URL given a "base URL." +This module defines a standard interface to break Uniform Resource Identifiers +(URIs, formerly named URLs) strings up in components (scheme, network location, +path, etc.), to combine the components back into a URI string, and to convert +a relative URI to an absolute URI given a base URI. -The module has been designed to match the Internet RFC on Relative Uniform -Resource Locators (and discovered a bug in an earlier draft!). It supports the -following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``, -``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``, -``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``, -``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``. +The module has been designed to match the Internet RFCs on Uniform Resource +Locators and Uniform Resource Identifiers. It supports the following URI +schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``, ``https``, +``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``, ``rsync``, +``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``, ``snews``, +``svn``, ``svn+ssh``, ``telnet``, ``wais``. .. versionadded:: 2.5 Support for the ``sftp`` and ``sips`` schemes. @@ -38,8 +41,8 @@ The :mod:`urlparse` module defines the f .. function:: urlparse(urlstring[, default_scheme[, allow_fragments]]) - Parse a URL into six components, returning a 6-tuple. This corresponds to the - general structure of a URL: ``scheme://netloc/path;parameters?query#fragment``. + Parse a URI into six components, returning a 6-tuple. This corresponds to the + general structure of a URI: ``scheme://netloc/path;parameters?query#fragment``. Each tuple item is a string, possibly empty. The components are not broken up in smaller parts (for example, the network location is a single string), and % escapes are not expanded. The delimiters as shown above are not part of the @@ -59,11 +62,11 @@ The :mod:`urlparse` module defines the f 'http://www.cwi.nl:80/%7Eguido/Python.html' If the *default_scheme* argument is specified, it gives the default addressing - scheme, to be used only if the URL does not specify one. The default value for + scheme, to be used only if the URI does not specify one. The default value for this argument is the empty string. If the *allow_fragments* argument is false, fragment identifiers are not - allowed, even if the URL's addressing scheme normally does support them. The + allowed, even if the URI's addressing scheme normally does support them. The default value for this argument is :const:`True`. The return value is actually an instance of a subclass of :class:`tuple`. This @@ -72,7 +75,7 @@ The :mod:`urlparse` module defines the f +------------------+-------+--------------------------+----------------------+ | Attribute | Index | Value | Value if not present | +==================+=======+==========================+======================+ - | :attr:`scheme` | 0 | URL scheme specifier | empty string | + | :attr:`scheme` | 0 | URI scheme specifier | empty string | +------------------+-------+--------------------------+----------------------+ | :attr:`netloc` | 1 | Network location part | empty string | +------------------+-------+--------------------------+----------------------+ @@ -109,10 +112,10 @@ The :mod:`urlparse` module defines the f values are lists of values for each name. The optional argument *keep_blank_values* is a flag indicating whether blank - values in URL encoded queries should be treated as blank strings. A true value - indicates that blanks should be retained as blank strings. The default false - value indicates that blank values are to be ignored and treated as if they were - not included. + values in percent-encoded queries should be treated as blank strings. A true + value indicates that blanks should be retained as blank strings. The default + false value indicates that blank values are to be ignored and treated as if they + were not included. The optional argument *strict_parsing* is a flag indicating what to do with parsing errors. If false (the default), errors are silently ignored. If true, @@ -129,10 +132,10 @@ The :mod:`urlparse` module defines the f name, value pairs. The optional argument *keep_blank_values* is a flag indicating whether blank - values in URL encoded queries should be treated as blank strings. A true value - indicates that blanks should be retained as blank strings. The default false - value indicates that blank values are to be ignored and treated as if they were - not included. + values in percent-encoded queries should be treated as blank strings. A true + value indicates that blanks should be retained as blank strings. The default + false value indicates that blank values are to be ignored and treated as if they + were not included. The optional argument *strict_parsing* is a flag indicating what to do with parsing errors. If false (the default), errors are silently ignored. If true, @@ -143,19 +146,19 @@ The :mod:`urlparse` module defines the f .. function:: urlunparse(parts) - Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument + Construct a URI from a tuple as returned by ``urlparse()``. The *parts* argument can be any six-item iterable. This may result in a slightly different, but - equivalent URL, if the URL that was parsed originally had unnecessary delimiters + equivalent URI, if the URI that was parsed originally had unnecessary delimiters (for example, a ? with an empty query; the RFC states that these are equivalent). .. function:: urlsplit(urlstring[, default_scheme[, allow_fragments]]) - This is similar to :func:`urlparse`, but does not split the params from the URL. - This should generally be used instead of :func:`urlparse` if the more recent URL + This is similar to :func:`urlparse`, but does not split the params from the URI. + This should generally be used instead of :func:`urlparse` if the more recent URI syntax allowing parameters to be applied to each segment of the *path* portion - of the URL (see :rfc:`2396`) is wanted. A separate function is needed to + of the URI (see :rfc:`3986`) is wanted. A separate function is needed to separate the path segments and parameters. This function returns a 5-tuple: (addressing scheme, network location, path, query, fragment identifier). @@ -165,7 +168,7 @@ The :mod:`urlparse` module defines the f +------------------+-------+-------------------------+----------------------+ | Attribute | Index | Value | Value if not present | +==================+=======+=========================+======================+ - | :attr:`scheme` | 0 | URL scheme specifier | empty string | + | :attr:`scheme` | 0 | URI scheme specifier | empty string | +------------------+-------+-------------------------+----------------------+ | :attr:`netloc` | 1 | Network location part | empty string | +------------------+-------+-------------------------+----------------------+ @@ -197,8 +200,8 @@ The :mod:`urlparse` module defines the f .. function:: urlunsplit(parts) Combine the elements of a tuple as returned by :func:`urlsplit` into a complete - URL as a string. The *parts* argument can be any five-item iterable. This may - result in a slightly different, but equivalent URL, if the URL that was parsed + URI as a string. The *parts* argument can be any five-item iterable. This may + result in a slightly different, but equivalent URI, if the URI that was parsed originally had unnecessary delimiters (for example, a ? with an empty query; the RFC states that these are equivalent). @@ -207,10 +210,10 @@ The :mod:`urlparse` module defines the f .. function:: urljoin(base, url[, allow_fragments]) - Construct a full ("absolute") URL by combining a "base URL" (*base*) with - another URL (*url*). Informally, this uses components of the base URL, in + Construct a full ("absolute") URI by combining a "base URI" (*base*) with + another URI (*url*). Informally, this uses components of the base URI, in particular the addressing scheme, the network location and (part of) the path, - to provide missing components in the relative URL. For example: + to provide missing components in the relative URI. For example: >>> from urlparse import urljoin >>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html') @@ -221,7 +224,7 @@ The :mod:`urlparse` module defines the f .. note:: - If *url* is an absolute URL (that is, starting with ``//`` or ``scheme://``), + If *url* is an absolute URI (that is, starting with ``//`` or ``scheme://``), the *url*'s host name and/or scheme will be present in the result. For example: .. doctest:: @@ -249,12 +252,16 @@ The :mod:`urlparse` module defines the f :rfc:`1808` - Relative Uniform Resource Locators This Request For Comments includes the rules for joining an absolute and a - relative URL, including a fair number of "Abnormal Examples" which govern the + relative URI, including a fair number of "Abnormal Examples" which govern the treatment of border cases. - :rfc:`2396` - Uniform Resource Identifiers (URI): Generic Syntax - Document describing the generic syntactic requirements for both Uniform Resource - Names (URNs) and Uniform Resource Locators (URLs). + :rfc:`3396` - Uniform Resource Identifier (URI): Generic Syntax + Document describing the generic syntactic requirements for Uniform Resource + Identifiers. + + :rfc:`3305` - Uniform Resource Identifiers (URIs), URLs, and Uniform Resource + Names (URNs): Clarifications and Recommendations + Informational document clarifying the difference between URLs and URIs. .. _urlparse-result-object: @@ -269,8 +276,8 @@ described in those functions, as well as .. method:: ParseResult.geturl() - Return the re-combined version of the original URL as a string. This may differ - from the original URL in that the scheme will always be normalized to lower case + Return the re-combined version of the original URI as a string. This may differ + from the original URI in that the scheme will always be normalized to lower case and empty components may be dropped. Specifically, empty parameters, queries, and fragment identifiers will be removed. diff --git a/Lib/urlparse.py b/Lib/urlparse.py --- a/Lib/urlparse.py +++ b/Lib/urlparse.py @@ -1,7 +1,13 @@ -"""Parse (absolute and relative) URLs. +"""Parse and build URIs. -See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding, -UC Irvine, June 1995. +See RFC 1738, "Uniform Resource Locators", by T. Berners-Lee, +L. Masinter and M. McCahill, December 1994, and RFC 3986, "Uniform +Resource Identifiers", by T. Berners-Lee, R. Fielding and L. Masinter, +January 2005. + +For historical reasons, "url" is used instead of "uri" throughout the +code, but the documentation correctly tells whether a function can work +with any URI. See RFC 3305 for more information. """ __all__ = ["urlparse", "urlunparse", "urljoin", "urldefrag", @@ -100,7 +106,7 @@ class ParseResult(namedtuple('ParseResul def urlparse(url, scheme='', allow_fragments=True): - """Parse a URL into 6 components: + """Parse a URI into 6 components: :///;?# Return a 6-tuple: (scheme, netloc, path, params, query, fragment). Note that we don't break the components up in smaller bits @@ -131,7 +137,7 @@ def _splitnetloc(url, start=0): return url[start:delim], url[delim:] # return (domain, rest) def urlsplit(url, scheme='', allow_fragments=True): - """Parse a URL into 5 components: + """Parse a URI into 5 components: :///?# Return a 5-tuple: (scheme, netloc, path, query, fragment). Note that we don't break the components up in smaller bits @@ -174,8 +180,8 @@ def urlsplit(url, scheme='', allow_fragm return v def urlunparse(data): - """Put a parsed URL back together again. This may result in a - slightly different, but equivalent URL, if the URL that was parsed + """Put a parsed URI back together again. This may result in a + slightly different, but equivalent URI, if the URI that was parsed originally had redundant delimiters, e.g. a ? with an empty query (the draft states that these are equivalent).""" scheme, netloc, url, params, query, fragment = data @@ -197,7 +203,7 @@ def urlunsplit(data): return url def urljoin(base, url, allow_fragments=True): - """Join a base URL and a possibly relative URL to form an absolute + """Join a base URI and a possibly relative URI to form an absolute interpretation of the latter.""" if not base: return url @@ -254,10 +260,10 @@ def urljoin(base, url, allow_fragments=T params, query, fragment)) def urldefrag(url): - """Removes any existing fragment from URL. + """Removes any existing fragment from URI. - Returns a tuple of the defragmented URL and the fragment. If - the URL contained no fragments, the second element is the + Returns a tuple of the defragmented URI and the fragment. If + the URI contained no fragments, the second element is the empty string. """ if '#' in url: @@ -269,7 +275,7 @@ def urldefrag(url): # unquote method for parse_qs and parse_qsl # Cannot use directly from urllib as it would create circular reference. -# urllib uses urlparse methods ( urljoin) +# urllib uses urlparse methods (urljoin) _hextochr = dict(('%02x' % i, chr(i)) for i in range(256)) _hextochr.update(('%02X' % i, chr(i)) for i in range(256)) @@ -292,10 +298,10 @@ def parse_qs(qs, keep_blank_values=0, st Arguments: - qs: URL-encoded query string to be parsed + qs: percent-encoded query string to be parsed keep_blank_values: flag indicating whether blank values in - URL encoded queries should be treated as blank strings. + percent-encoded queries should be treated as blank strings. A true value indicates that blanks should be retained as blank strings. The default false value indicates that blank values are to be ignored and treated as if they were @@ -318,10 +324,10 @@ def parse_qsl(qs, keep_blank_values=0, s Arguments: - qs: URL-encoded query string to be parsed + qs: percent-encoded query string to be parsed keep_blank_values: flag indicating whether blank values in - URL encoded queries should be treated as blank strings. A + percent-encoded queries should be treated as blank strings. A true value indicates that blanks should be retained as blank strings. The default false value indicates that blank values are to be ignored and treated as if they were not included.