Issue 1462525: URI parsing library

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43140

classification

Title:	URI parsing library
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	orsenthil	Nosy List:	dalke, facundobatista, ijmorlan, jjlee, orsenthil, paulj, skip.montanaro, vila, vincentk
Priority:	normal	Keywords:	patch

Created on 2006-04-01 03:30 by paulj, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
uriparse.py	paulj, 2006-04-01 03:30	URI parsing library
uriparse.py	paulj, 2006-04-02 21:19	URI parsing library v22
uriparse.py	paulj, 2006-04-03 17:13	URI parsing library v23
uriparse.py	paulj, 2006-06-04 01:09	URI parsing library v32
testurlparse.py	ijmorlan, 2007-10-29 16:38

Messages (19)
msg49925 - (view)	Author: Paul Jimenez (paulj)	Date: 2006-04-01 03:30
Per the original discussion at http://mail.python.org/pipermail/python-dev/2005-November/058301.html I'm submitting a library meant to deprecate the existing urlparse library. Questions and comments welcome.
msg49926 - (view)	Author: John J Lee (jjlee)	Date: 2006-04-01 23:07
Logged In: YES user_id=261020 This certainly seems needed (though I still haven't properly read 3986 and 3987, and not sure how IRIs fit in with everything else). Perhaps a bit late for 2.5. -1 on the name: makes it seem the difference between urlparse and uriparse is something to do with the already murky distinction between URIs and URLs. How about rfc3986? Prosaic, but hits the nail on the head. Must read those RFCs and review this...
msg49927 - (view)	Author: John J Lee (jjlee)	Date: 2006-04-02 00:20
Logged In: YES user_id=261020 Some mostly-stylistic / minor comments on the patch from a quick skim (I hope to post some comments on the trickier issues later): Follow PEP 8. Some issues I noticed: - Inconsistent use of case: URI vs. Uri. - Triple-quoted docstrings should use " not ' for editor-friendliness. - Strings should not be abused as comments: If you mean to use a docstring, use a docstring; otherwise, use a comment (I'm referring here to your use of strings immediately before def statements). - import usage like import posixpath as ppath is usually frowned upon: just import posixpath. - Use of whitespace in e.g. dict displays and listcomps is non-standard. [x for x in y], not [ x for x in y ] - Indentation in docstrings is non-standard. - Docstring-writing conventions are non-standard. Other things: - Having read your original python-dev post, I still think UrlParser / URIParser could be simpler. I'll try and supply an actual suggested patch later. - MailToURIParser appears to support a different interface to all the others. If this is really necessary for standards or pragmatic reasons, those parse and unparse methods should just be separate functions. - Documentation for the module is missing. This would document the API and perhaps briefly explain the background (what's changed to require this new module) and correct usage, briefly explaining terms like "URI reference". Some well-chosen examples are always good, of course. - The tests should go in a separate module test/test_<modulename>.py and follow the conventions there. - Would be very nice to explicitly reference RFC 3986 section numbers in the code. I'll try and do this when I review it properly. - Use of URI vs. URL distinction is incorrect. Finally, just BTW: http://en.wikipedia.org/wiki/Uniform_Resource_Identifier """ The contemporary point of view among the working group that oversees URIs is that the terms URL and URN are context-dependent aspects of URI and rarely need to be distinguished. """ Heh, spot on! Still, like I said, I agree terms like "URI reference" deserve to be adopted.
msg49928 - (view)	Author: John J Lee (jjlee)	Date: 2006-04-02 00:32
Logged In: YES user_id=261020 Just a quick note listing some of the things I intend to worry about <wink>: 1. IRIs 2. Python unicode strings 3. Percent-encoding. See 1. and 2. 4. Interaction with other stdlib modules 5. RFC 3986 compliance (duh :-) It certainly seemed from a brief email discussion with Mike Brown a while back (who knows all this 10 times better than me) that 1., 2. and 3. are not so easily brushed under the carpet as you hope, but I'm very glad if you're right!-) I think these things need to be at least thought through by a few people before rushing a new module into the stdlib: we already have two modules containing outdated URL parsing code, we don't want to end up with a third one. Don't want to sound negative though, it's great that you wrote this!
msg49929 - (view)	Author: Paul Jimenez (paulj)	Date: 2006-04-02 21:19
Logged In: YES user_id=25150 Naming: I also considered urlparse2 (ala urllib2) but liked having a name without a version number attached. rfc3986 would also work I suppose, but seems a bit... clunky. MailtoURIParser: You seem to have missed the point (probably due to my poor documentation): none of the *URIParser classes are meant to be directly used; they're just the default population of an extensible structure that URIParser uses to do the work of parsing. Let's move discussion to python-dev. I'll put changed/fixed/upgraded versions here as I adjust them due to feedback. Here's the first (adjusted due to your feedback).
msg49930 - (view)	Author: Paul Jimenez (paulj)	Date: 2006-04-03 17:13
Logged In: YES user_id=25150 Oops. fix some editing bugs.
msg49931 - (view)	Author: Andrew Dalke (dalke) *	Date: 2006-11-06 10:37
Logged In: YES user_id=190903 # new >>> uriparse.urljoin("http://spam/", "foo/bar") 'http://spam//foo/bar' >>> # existing >>> urlparse.urljoin("http://spam/", "foo/bar") 'http://spam/foo/bar' >>> Should not have the "//" again in your code. >>> import urlparse >>> import uriparse >>> urlparse.urljoin("http://blah", "/spam/") 'http://blah/spam/' >>> uriparse.urljoin("http://blah", "/spam/") 'http://blah/spam' >>> join 'http://www.guardian.co.uk/' u' ' urlparse: u'http://www.guardian.co.uk/ ' != uriparse: u'http://www.guardian.co.uk// ' join 'http://boingboing.net/' u' http://www.newsalloy.com/subrss4.gif' (yes, with a leading space in the relative URL) urlparse: u'http://boingboing.net/ http://www.newsalloy.com/subrss4.gif' != uriparse: u' http://www.newsalloy.com/subrss4.gif' I'll add a script to test wild web pages and compare urlparse and uriparse's respective urljoin methods. ALSO: Need an __all__ which excludes those *URIParser classes.
msg49932 - (view)	Author: Andrew Dalke (dalke) *	Date: 2006-11-06 10:41
Logged In: YES user_id=190903 Can't figure out how to add a file to this @#$%*@#%$ bug reporting system. Here's a checker to compare urljoin from urlparse and uriparse import urllib2 import urlparse import uriparse import BeautifulSoup for url in ( "http://python.org/", "http://www.perl.org/", ## "http://aspn.activestate.com/ASPN/Cookbook/Python", # they have \n in urls! "http://slashdot.org/", "http://cnn.com/", "http://bbc.co.uk/", "http://www.foxnews.com/", "http://reddit.com/", "http://yahoo.com/", "http://planetpython.org/", "http://www.slate.com/", "http://anarchaia.org/index.html", "http://www.ensembl.org/index.html", ): print "Processing", url f = urllib2.urlopen(url) soup = BeautifulSoup.BeautifulSoup(f) rel_url_list = [] for a in soup.findAll("a", href=True): rel_url_list.append(a["href"]) for img in soup.findAll("img", src=True): rel_url_list.append(img["src"]) for rel_url in rel_url_list: rel_url = rel_url.strip() url_joined = urlparse.urljoin(url, rel_url) uri_joined = uriparse.urljoin(url, rel_url) if url_joined != uri_joined: # urijoin can add an extra '/' ## if url_joined == uri_joined+"/": ## continue ## if url_joined.replace("//", "/") == uri_joined.replace("//", "/"): ## continue ## # 'http://cnn.com/' u'/cnnsi/scorecard/?cnn=yes' ## # url_joined == u'http://cnn.com/cnnsi/scorecard/?cnn=yes' ## # uri_joined == u'http://cnn.com/cnnsi/scorecard?cnn=yes' ## if url_joined.replace("/?", "?") == uri_joined: ## continue print repr(url), repr(rel_url) print " ", repr(url_joined), "!=", repr(uri_joined)
msg56909 - (view)	Author: Isaac Morland (ijmorlan)	Date: 2007-10-29 16:38
This is probably overkill, but I've created a Python script (attached) that runs all the tests given in Section 5.4 of RFC 3986. It reports the following: baseurl=http://a/b/c/d;p?q failed for ?y: got http://a/b/c/?y, expected http://a/b/c/d;p?y failed for ../../../g: got http://a/../g, expected http://a/g failed for ../../../../g: got http://a/../../g, expected http://a/g failed for /./g: got http://a/./g, expected http://a/g failed for /../g: got http://a/../g, expected http://a/g failed for http:g: got http://a/b/c/g, expected http:g The last of these is sanctioned by the RFC as acceptable for backward compatibility, so I'll ignore that. The remainder suggest that in addition to the query-relative bug, there is a problem with not reducing "/./" to just "/", and with dropping excess occurrences of ".." that would go above the root. On the other hand, these additional issues are listed in the RFC as "abnormal" so I'm not sure if people are going to want to put in the time to address them.
msg57705 - (view)	Author: vincent kraeutler (vincentk)	Date: 2007-11-20 18:57
Quite like urlparse, uriparse does not fail on input which does not represent valid URI's. At least not early or reliably enough. Specifically, I noticed that urisplit does not fail on input strings with a missing scheme, such as "foo.com/bar". I see no (straightforward) solution to this problem, short of using a proper parser library such as Haskell's Parsec (I unfortunately know of no Python equivalent), but I thought I might want to report this issue nevertheless. The following might work as a quick-fix: Replace regex.match(foo,bar).groups() with something like: mm = re.match(regex, uri) sp = mm.span() if (-1 in sp) or (sp[1] - sp[0] != len(uri)): raise ValueError, "uri regex did not match complete input" p = mm.groups()
msg57736 - (view)	Author: vincent kraeutler (vincentk)	Date: 2007-11-21 10:29
Some more notes. a) RFC3986 explicitly states that the presented regex (which you use) """ is the regular expression for breaking-down a well-formed URI reference into its components. """ (Emphasis added). I am not sure this is a particularly good starting point for parsing potentially security-critical data. b) The parser fails on URI's containing numerical IPv6 addresses (e.g. "http://[::1]:88/path"). Specifically, the following code in split_authority is broken: if hostport and ':' in hostport: host, port = hostport.split(':', 1) Clearly, if the authority may contain a ":" in the host's IP field, you cannot simply split() off the port part. Again, I am afraid I have no simple solution. Hate to sound so negative. Kind regards, v.
msg57756 - (view)	Author: vincent kraeutler (vincentk)	Date: 2007-11-22 11:21
In the meantime, I have found a very nice parser combinator library for Python (pyparse) and have implemented a validating parser for RFC 3986 URI's by more or less simply converting the complete ABNF grammar found in the RFC. Obviously, this will never make it into the stdlib, due to a dependency on an external library (pyparse), but it might be useful to other people as well. It's available here: http://www.kraeutler.net/vincent/pub/netaddress/netaddress-0.1.tar.gz
msg69227 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2008-07-03 19:10
Senthil, we should incorporate the tests from RFC 3986 to the test suite, what do you think? Coul we integrate the effort from Paul Jimenez and the current urlparse and achieve a RFC compliant library? Should we handle this compliance in this bug?
msg71919 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2008-08-25 13:14
Hello Paul, Have you beeing keeping track of urlparse changes in Python2.6? I anaylzed your patch and read through the RFC3986 and have the following comments: 1) The usage of this module is very diffirent from the current urlparse module's usage. It might be that this module was designed to co-exist with urlparse, providing certain additional functionalities. But inorder to replace urlparse, I find this module is "Backward Incompatible with the code base". Some comments extra features provided /claims of this module. 2) The module provides URI handling framework that includes default URI Parsers for common URI Schemes. - RFC3986 specifies that scheme handling part is left to the separate RFC describing the schemes. - uriparse library attempts that by providing default port and default hostname for certain schemes, but that can be made available as a patch to urlparse rather than new library. The need for such a change in urlparse needs to be analyzed, as there has not been any requirement raised as such for proving default port, default host for schemes whenever it is applicable. 3) urlsplit, urlunsplit, spliting the authority into sub-components is available in the current urlparse library itself and is RFC3986 conformant. 4) urljoin in the current urlparse ( patched with fixes) is currently RFC3986conformant. What urlparse further requires and this patch also lacks is ( as commented by John J Lee) 1) Handling of IRIs. 2) Python Unicode Strings. 3) Percent- Encodings for IRIs and Python Unicode Strings. ( There is a discussion going on on quote and unquote of unicode, and thatwould be basically be extended to above points as well) - If required, we can adopt the default host and port provision mechanisms as mentioned in this patch to the current urlparse. Other that that, I see that urlparse currently has all changes as mentioned inthis patch and makes the attached patch an obsolete one. Please let me know your comments/ thoughts. Thanks.
msg93512 - (view)	Author: Paul Jimenez (paulj)	Date: 2009-10-03 22:37
Senthil wrote: > > Senthil <orsenthil@gmail.com> added the comment: > > > > Hello Paul, > > Have you beeing keeping track of urlparse changes in Python2.6? No - do you have pointers to the particular changes you're referring to? I've done a bit of trying to catch up by searching the mailing list, but want to make sure I know what you're referring to in particular. > > I > > anaylzed your patch and read through the RFC3986 and have the > > following comments: > > > > 1) The usage of this module is very diffirent from the current > > urlparse module's usage. It might be that this module was designed to > > co-exist with urlparse, providing certain additional functionalities. > > But inorder to replace urlparse, I find this module is "Backward > > Incompatible with the code base". > > > > Some comments extra features provided /claims of this module. > > Yes, there was no design goal of backward compatibility. > > 2) The module provides URI handling framework that includes default > > URI Parsers for common URI Schemes. > > - RFC3986 specifies that scheme handling part is left to the > > separate RFC describing the schemes. > > - uriparse library attempts that by providing default port and > > default hostname for certain schemes, but that can be made available > > as a patch to urlparse rather than new library. The need for such a > > change in urlparse needs to be analyzed, as there has not been any > > requirement raised as such for proving default port, default host for > > schemes whenever it is applicable. > > Okay; It just seemed completist to provide said defaults. > > 3) urlsplit, urlunsplit, spliting the authority into sub-components is > > available in the current urlparse library itself and is RFC3986 > > conformant. > > Ah... it used to not do this for unknown schemes, which was my original impetus for this. > > 4) urljoin in the current urlparse ( patched with fixes) is currently > > RFC3986conformant. > > > > What urlparse further requires and this patch also lacks is ( as > > commented by John J Lee) > > 1) Handling of IRIs. > > 2) Python Unicode Strings. > > 3) Percent- Encodings for IRIs and Python Unicode Strings. > > ( There is a discussion going on on quote and unquote of unicode, and > > thatwould be basically be extended to above points as well) > > > > - If required, we can adopt the default host and port provision > > mechanisms as mentioned in this patch to the current urlparse. > > > > Other that that, I see that urlparse currently has all changes as > > mentioned inthis patch and makes the attached patch an obsolete one. > > > > Please let me know your comments/ thoughts. > > It seems that urlparse now works for the case that caused me to rewrite this (see the first comment on this bug for a link to the python-dev archives where I posted about the 'itch' this code 'scratched'), so it's fine with me if it just goes away now.
msg104298 - (view)	Author: Paul Jimenez (paulj)	Date: 2010-04-27 06:48
Since no one else has commented on this in over a year, and the new (2.6+) code works fine, I'll just close this to help clean things up.
msg104477 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2010-04-29 02:16
Should we close this as out-of-date? I was inclined to see it as fixed as urlparse has gone changes in direction as suggested by the issue. Sorry Paul, for no response. Regarding this issue, I plan to use the testcases provided in the patch in the stdlib testsuite and fix things there or comment/document it in the code where parsing conflict arises. This will help us keep track too. I ran the test cases in the patch against the current trunk and i see 4 tests failing (like borderline scenarios of parsing). I shall take it up,commit the test cases to the trunk and fix it.
msg104688 - (view)	Author: Paul Jimenez (paulj)	Date: 2010-05-01 02:19
That sounds great - at least something useful will come out of this, even if it's just more tests for urlparse :)
msg105183 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2010-05-07 04:31
Committed the tests in the r80908, r80909, r80910 and r80911. The backward incompatible test cases, as in whose parsing requirements have changed since the previous RFC has been commented out. There were 4 abnormal scenarios and one strict parsing requirement (overridden by relaxed parsing requirement use case). With respect to commented out tests, if we come across parsing behavior in applications relying on it, we can address it then. Thanks Paul and others for tracking this issue.

History
Date	User	Action	Args
2022-04-11 14:56:16	admin	set	github: 43140
2010-07-11 05:18:40	orsenthil	unlink	issue1500504 dependencies
2010-05-07 04:31:54	orsenthil	set	status: open -> closed type: enhancement -> behavior messages: + msg105183 resolution: accepted -> fixed stage: test needed -> resolved
2010-05-01 02:19:52	paulj	set	messages: + msg104688
2010-04-29 02:16:17	orsenthil	set	status: closed -> open assignee: facundobatista -> orsenthil resolution: out of date -> accepted messages: + msg104477
2010-04-28 17:03:16	r.david.murray	set	resolution: out of date
2010-04-27 06:48:04	paulj	set	status: open -> closed messages: + msg104298
2009-10-03 22:37:33	paulj	set	messages: + msg93512
2009-04-22 17:24:49	ajaksu2	link	issue1500504 dependencies
2009-02-13 01:40:20	ajaksu2	set	stage: test needed type: enhancement versions: + Python 2.7
2009-02-13 01:36:07	ajaksu2	link	issue1591035 dependencies
2008-08-25 13:14:36	orsenthil	set	messages: + msg71919
2008-07-03 19:10:38	facundobatista	set	assignee: facundobatista messages: + msg69227 nosy: + facundobatista, orsenthil
2008-01-05 11:56:20	vila	set	nosy: + vila
2007-11-22 11:21:43	vincentk	set	messages: + msg57756
2007-11-21 10:29:37	vincentk	set	messages: + msg57736
2007-11-20 18:57:05	vincentk	set	nosy: + vincentk messages: + msg57705
2007-10-29 16:38:16	ijmorlan	set	files: + testurlparse.py nosy: + ijmorlan messages: + msg56909
2007-08-30 22:26:17	skip.montanaro	set	nosy: + skip.montanaro
2006-04-01 03:30:42	paulj	create