classification
Title: URI parsing library
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: dalke, facundobatista, ijmorlan, jjlee, orsenthil, paulj, skip.montanaro, vila, vincentk
Priority: normal Keywords: patch

Created on 2006-04-01 03:30 by paulj, last changed 2010-05-07 04:31 by orsenthil. This issue is now closed.

Files
File name Uploaded Description Edit
uriparse.py paulj, 2006-04-01 03:30 URI parsing library
uriparse.py paulj, 2006-04-02 21:19 URI parsing library v22
uriparse.py paulj, 2006-04-03 17:13 URI parsing library v23
uriparse.py paulj, 2006-06-04 01:09 URI parsing library v32
testurlparse.py ijmorlan, 2007-10-29 16:38
Messages (19)
msg49925 - (view) Author: Paul Jimenez (paulj) Date: 2006-04-01 03:30
Per the original discussion at
http://mail.python.org/pipermail/python-dev/2005-November/058301.html
I'm submitting a library meant to deprecate the
existing urlparse library.  Questions and comments welcome.
msg49926 - (view) Author: John J Lee (jjlee) Date: 2006-04-01 23:07
Logged In: YES 
user_id=261020

This certainly seems needed (though I still haven't properly
read 3986 and 3987, and not sure how IRIs fit in with
everything else).  Perhaps a bit late for 2.5.

-1 on the name: makes it seem the difference between
urlparse and uriparse is something to do with the already
murky distinction between URIs and URLs.  How about rfc3986?
 Prosaic, but hits the nail on the head.

Must read those RFCs and review this...
msg49927 - (view) Author: John J Lee (jjlee) Date: 2006-04-02 00:20
Logged In: YES 
user_id=261020

Some mostly-stylistic / minor comments on the patch from a
quick skim (I hope to post some comments on the trickier
issues later):

Follow PEP 8.  Some issues I noticed:

- Inconsistent use of case: URI vs. Uri.
- Triple-quoted docstrings should use " not ' for
editor-friendliness.
- Strings should not be abused as comments: If you mean to
use a docstring, use a docstring; otherwise, use a comment
(I'm referring here to your use of strings immediately
*before* def statements).
- import usage like import posixpath as ppath is usually
frowned upon: just import posixpath.
- Use of whitespace in e.g. dict displays and listcomps is
non-standard.  [x for x in y], not [ x for x in y ]
- Indentation in docstrings is non-standard.
- Docstring-writing conventions are non-standard.


Other things:

- Having read your original python-dev post, I still think
UrlParser / URIParser could be simpler.  I'll try and supply
an actual suggested patch later.
- MailToURIParser appears to support a different interface
to all the others.  If this is really necessary for
standards or pragmatic reasons, those parse and unparse
methods should just be separate functions.
- Documentation for the module is missing.  This would
document the API and perhaps briefly explain the background
(what's changed to require this new module) and correct
usage, briefly explaining terms like "URI reference".  Some
well-chosen examples are always good, of course.
- The tests should go in a separate module
test/test_<modulename>.py and follow the conventions there.
- Would be very nice to explicitly reference RFC 3986
section numbers in the code.  I'll try and do this when I
review it properly.
- Use of URI vs. URL distinction is incorrect.


Finally, just BTW:

http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
"""
The contemporary point of view among the working group that
oversees URIs is that the terms URL and URN are
context-dependent aspects of URI and rarely need to be
distinguished.
"""

Heh, spot on!  Still, like I said, I agree terms like "URI
reference" deserve to be adopted.
msg49928 - (view) Author: John J Lee (jjlee) Date: 2006-04-02 00:32
Logged In: YES 
user_id=261020

Just a quick note listing some of the things I intend to
worry about <wink>:

1. IRIs

2. Python unicode strings

3. Percent-encoding.  See 1. and 2.

4. Interaction with other stdlib modules

5. RFC 3986 compliance (duh :-)

It certainly seemed from a brief email discussion with Mike
Brown a while back (who knows all this 10 times better than
me) that 1., 2. and 3. are not so easily brushed under the
carpet as you hope, but I'm very glad if you're right!-)

I think these things need to be at least thought through by
a few people before rushing a new module into the stdlib: we
already have two modules containing outdated URL parsing
code, we don't want to end up with a third one.

Don't want to sound negative though, it's great that you
wrote this!
msg49929 - (view) Author: Paul Jimenez (paulj) Date: 2006-04-02 21:19
Logged In: YES 
user_id=25150

Naming:
  I also considered urlparse2 (ala urllib2) but liked having
a name without a version number attached.  rfc3986 would
also work I suppose, but seems a bit... clunky.

MailtoURIParser:
  You seem to have missed the point (probably due to my poor
documentation): none of the *URIParser classes are meant to
be directly used; they're just the default population of an
extensible structure that URIParser uses to do the work of
parsing.  

Let's move discussion to python-dev.  I'll put
changed/fixed/upgraded versions here as I adjust them due to
feedback.  Here's the first (adjusted due to your feedback).
msg49930 - (view) Author: Paul Jimenez (paulj) Date: 2006-04-03 17:13
Logged In: YES 
user_id=25150

Oops. fix some editing bugs. 
msg49931 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2006-11-06 10:37
Logged In: YES 
user_id=190903

# new
>>> uriparse.urljoin("http://spam/", "foo/bar")
'http://spam//foo/bar'
>>> 

# existing
>>> urlparse.urljoin("http://spam/", "foo/bar")
'http://spam/foo/bar'
>>> 

Should not have the "//" again in your code.


>>> import urlparse
>>> import uriparse
>>> urlparse.urljoin("http://blah", "/spam/")
'http://blah/spam/'
>>> uriparse.urljoin("http://blah", "/spam/")
'http://blah/spam'
>>> 

join 'http://www.guardian.co.uk/' u' '
urlparse: u'http://www.guardian.co.uk/ ' !=
uriparse: u'http://www.guardian.co.uk// '

join 'http://boingboing.net/' u'
http://www.newsalloy.com/subrss4.gif'
  (yes, with a leading space in the relative URL)
urlparse: u'http://boingboing.net/
http://www.newsalloy.com/subrss4.gif' !=
uriparse: u' http://www.newsalloy.com/subrss4.gif'

I'll add a script to test wild web pages and compare
urlparse and uriparse's
respective urljoin methods.

ALSO: Need an __all__ which excludes those *URIParser classes.
msg49932 - (view) Author: Andrew Dalke (dalke) * (Python committer) Date: 2006-11-06 10:41
Logged In: YES 
user_id=190903

Can't figure out how to add a file to this @#$%*@#%$ bug
reporting system.

Here's a checker to compare urljoin from urlparse and uriparse

import urllib2
import urlparse
import uriparse
import BeautifulSoup

for url in (
    "http://python.org/",
    "http://www.perl.org/",
##    "http://aspn.activestate.com/ASPN/Cookbook/Python", #
they have \n in urls!
    "http://slashdot.org/",
    "http://cnn.com/",
    "http://bbc.co.uk/",
    "http://www.foxnews.com/",
    "http://reddit.com/",
    "http://yahoo.com/",
    "http://planetpython.org/",
    "http://www.slate.com/",
    "http://anarchaia.org/index.html",
    "http://www.ensembl.org/index.html",
    ):
    print "Processing", url
    f = urllib2.urlopen(url)
    soup = BeautifulSoup.BeautifulSoup(f)

    rel_url_list = []
    for a in soup.findAll("a", href=True):
        rel_url_list.append(a["href"])
    for img in soup.findAll("img", src=True):
        rel_url_list.append(img["src"])

    for rel_url in rel_url_list:
        rel_url = rel_url.strip()
        url_joined = urlparse.urljoin(url, rel_url)
        uri_joined = uriparse.urljoin(url, rel_url)
        if url_joined != uri_joined:
            # urijoin can add an extra '/'
##            if url_joined == uri_joined+"/":
##                continue
##            if url_joined.replace("//", "/") ==
uri_joined.replace("//", "/"):
##                continue
##            # 'http://cnn.com/' u'/cnnsi/scorecard/?cnn=yes'
##            # url_joined ==
u'http://cnn.com/cnnsi/scorecard/?cnn=yes'
##            # uri_joined ==
u'http://cnn.com/cnnsi/scorecard?cnn=yes'
##            if url_joined.replace("/?", "?") == uri_joined:
##                continue
            
            print repr(url), repr(rel_url)
            print "  ", repr(url_joined), "!=", repr(uri_joined)
msg56909 - (view) Author: Isaac Morland (ijmorlan) Date: 2007-10-29 16:38
This is probably overkill, but I've created a Python script (attached)
that runs all the tests given in Section 5.4 of RFC 3986.  It reports
the following:

baseurl=http://a/b/c/d;p?q
failed for ?y: got http://a/b/c/?y, expected http://a/b/c/d;p?y
failed for ../../../g: got http://a/../g, expected http://a/g
failed for ../../../../g: got http://a/../../g, expected http://a/g
failed for /./g: got http://a/./g, expected http://a/g
failed for /../g: got http://a/../g, expected http://a/g
failed for http:g: got http://a/b/c/g, expected http:g

The last of these is sanctioned by the RFC as acceptable for backward
compatibility, so I'll ignore that.  The remainder suggest that in
addition to the query-relative bug, there is a problem with not reducing
"/./" to just "/", and with dropping excess occurrences of ".." that
would go above the root.  On the other hand, these additional issues are
listed in the RFC as "abnormal" so I'm not sure if people are going to
want to put in the time to address them.
msg57705 - (view) Author: vincent kraeutler (vincentk) Date: 2007-11-20 18:57
Quite like urlparse, uriparse does not fail on input which does not
represent valid URI's. At least not early or reliably enough.
Specifically, I noticed that urisplit does not fail on input strings
with a missing scheme, such as "foo.com/bar". 

I see no (straightforward) solution to this problem, short of using a
proper parser library such as Haskell's Parsec (I unfortunately know of
no Python equivalent), but I thought I might want to report this issue
nevertheless. 

The following might work as a quick-fix: Replace
regex.match(foo,bar).groups()

with something like:

    mm = re.match(regex, uri)
    sp = mm.span()
    if (-1 in sp) or (sp[1] - sp[0] != len(uri)):
        raise ValueError, "uri regex did not match complete input"
    
    p = mm.groups()
msg57736 - (view) Author: vincent kraeutler (vincentk) Date: 2007-11-21 10:29
Some more notes. 
a) RFC3986 explicitly states that the presented regex (which you use)
   """ is the regular expression for breaking-down a *well-formed* URI
reference into its components. """ (Emphasis added). I am not sure this
is a particularly good starting point for parsing potentially
security-critical data.

b) The parser fails on URI's containing numerical IPv6 addresses (e.g.
"http://[::1]:88/path"). Specifically, the following code in
split_authority is broken:

    if hostport and ':' in hostport:
        host, port = hostport.split(':', 1)

Clearly, if the authority may contain a ":" in the host's IP field, you
cannot simply split() off the port part.

Again, I am afraid I have no simple solution. Hate to sound so negative.

Kind regards,
v.
msg57756 - (view) Author: vincent kraeutler (vincentk) Date: 2007-11-22 11:21
In the meantime, I have found a very nice parser combinator library for
Python (pyparse) and have implemented a validating parser for RFC 3986
URI's by more or less simply converting the complete ABNF grammar found
in the RFC. Obviously, this will never make it into the stdlib, due to a
dependency on an external library (pyparse), but it might be useful to
other people as well.

It's available here:
http://www.kraeutler.net/vincent/pub/netaddress/netaddress-0.1.tar.gz
msg69227 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-07-03 19:10
Senthil, we should incorporate the tests from RFC 3986 to the test
suite, what do you think?

Coul we integrate the effort from Paul Jimenez and the current urlparse
and achieve a RFC compliant library? Should we handle this compliance in
this bug?
msg71919 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-08-25 13:14
Hello Paul, 
Have you beeing keeping track of urlparse changes in Python2.6? I 
anaylzed your patch and read through the RFC3986 and have the 
following comments:

1) The usage of this module is very diffirent from the current 
urlparse module's usage. It might be that this module was designed to 
co-exist with urlparse, providing certain additional functionalities. 
But inorder to replace urlparse, I find this module is "Backward 
Incompatible with the code base". 

Some comments extra features provided /claims of this module.

2) The module provides URI handling framework that includes default 
URI Parsers for common URI Schemes.
    - RFC3986 specifies that scheme handling part is left to the 
separate RFC describing the schemes. 
    - uriparse library attempts that by providing default port and 
default hostname for certain schemes, but that can be made available 
as a patch to urlparse rather than new library. The need for such a 
change in urlparse needs to be analyzed, as there has not been any 
requirement raised as such for proving default port, default host for 
schemes whenever it is applicable.

3) urlsplit, urlunsplit, spliting the authority into sub-components is
available in the current urlparse library itself and is RFC3986 
conformant.

4) urljoin in the current urlparse ( patched with fixes) is currently 
RFC3986conformant.

What urlparse further requires and this patch also lacks is ( as 
commented by John J Lee)
1) Handling of IRIs.
2) Python Unicode Strings.
3) Percent- Encodings for IRIs and Python Unicode Strings.
( There is a discussion going on on quote and unquote of unicode, and 
thatwould be basically be extended to above points as well)

- If required, we can adopt the default host and port provision 
mechanisms as  mentioned in this patch to the current urlparse. 

Other that that, I see that urlparse currently has all changes as 
mentioned inthis patch and makes the attached patch an obsolete one.

Please let me know your comments/ thoughts.

Thanks.
msg93512 - (view) Author: Paul Jimenez (paulj) Date: 2009-10-03 22:37
Senthil wrote:
> > Senthil <orsenthil@gmail.com> added the comment:
> >
> > Hello Paul, 
> > Have you beeing keeping track of urlparse changes in Python2.6? 

No - do you have pointers to the particular changes you're
referring to?  I've done a bit of trying to catch up by searching
the mailing list, but want to make sure I know what you're
referring to in particular.

> > I 
> > anaylzed your patch and read through the RFC3986 and have the 
> > following comments:
> >
> > 1) The usage of this module is very diffirent from the current 
> > urlparse module's usage. It might be that this module was designed to 
> > co-exist with urlparse, providing certain additional functionalities. 
> > But inorder to replace urlparse, I find this module is "Backward 
> > Incompatible with the code base". 
> >
> > Some comments extra features provided /claims of this module.
> >   

Yes, there was no design goal of backward compatibility.

> > 2) The module provides URI handling framework that includes default 
> > URI Parsers for common URI Schemes.
> >     - RFC3986 specifies that scheme handling part is left to the 
> > separate RFC describing the schemes. 
> >     - uriparse library attempts that by providing default port and 
> > default hostname for certain schemes, but that can be made available 
> > as a patch to urlparse rather than new library. The need for such a 
> > change in urlparse needs to be analyzed, as there has not been any 
> > requirement raised as such for proving default port, default host for 
> > schemes whenever it is applicable.
> >   

Okay; It just seemed completist to provide said defaults.

> > 3) urlsplit, urlunsplit, spliting the authority into sub-components is
> > available in the current urlparse library itself and is RFC3986 
> > conformant.
> >   

Ah... it used to not do this for unknown schemes, which was my
original impetus for this.

> > 4) urljoin in the current urlparse ( patched with fixes) is currently 
> > RFC3986conformant.
> >
> > What urlparse further requires and this patch also lacks is ( as 
> > commented by John J Lee)
> > 1) Handling of IRIs.
> > 2) Python Unicode Strings.
> > 3) Percent- Encodings for IRIs and Python Unicode Strings.
> > ( There is a discussion going on on quote and unquote of unicode, and 
> > thatwould be basically be extended to above points as well)
> >
> > - If required, we can adopt the default host and port provision 
> > mechanisms as  mentioned in this patch to the current urlparse. 
> >
> > Other that that, I see that urlparse currently has all changes as 
> > mentioned inthis patch and makes the attached patch an obsolete one.
> >
> > Please let me know your comments/ thoughts.
> >   

It seems that urlparse now works for the case that caused me to rewrite
this (see the first comment on this bug for a link to the python-dev 
archives
where I posted about the 'itch' this code 'scratched'), so it's fine 
with me if
it just goes away now.
msg104298 - (view) Author: Paul Jimenez (paulj) Date: 2010-04-27 06:48
Since no one else has commented on this in over a year, and the new (2.6+) code works fine, I'll just close this to help clean things up.
msg104477 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-04-29 02:16
Should we close this as out-of-date? I was inclined to see it as fixed as urlparse has gone changes in direction as suggested by the issue.

Sorry Paul, for no response.

Regarding this issue, I plan to use the testcases provided in the patch in the stdlib testsuite and fix things there or comment/document it in the code where parsing conflict arises. This will help us keep track too.

I ran the test cases in the patch against the current trunk and i see 4 tests failing (like borderline scenarios of parsing). I shall take it up,commit the test cases to the trunk and fix it.
msg104688 - (view) Author: Paul Jimenez (paulj) Date: 2010-05-01 02:19
That sounds great - at least something useful will come out of this, even if it's just more tests for urlparse :)
msg105183 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-05-07 04:31
Committed the tests in the r80908, r80909, r80910 and r80911.
The backward incompatible test cases, as in whose parsing requirements have changed since the previous RFC has been commented out. There were 4 abnormal scenarios and one strict parsing requirement (overridden by relaxed parsing requirement use case).

With respect to commented out tests, if we come across parsing behavior in applications relying on it, we can address it then.

Thanks Paul and others for tracking this issue.
History
Date User Action Args
2010-07-11 05:18:40orsenthilunlinkissue1500504 dependencies
2010-05-07 04:31:54orsenthilsetstatus: open -> closed
type: enhancement -> behavior
messages: + msg105183

resolution: accepted -> fixed
stage: test needed -> resolved
2010-05-01 02:19:52pauljsetmessages: + msg104688
2010-04-29 02:16:17orsenthilsetstatus: closed -> open
assignee: facundobatista -> orsenthil
resolution: out of date -> accepted
messages: + msg104477
2010-04-28 17:03:16r.david.murraysetresolution: out of date
2010-04-27 06:48:04pauljsetstatus: open -> closed

messages: + msg104298
2009-10-03 22:37:33pauljsetmessages: + msg93512
2009-04-22 17:24:49ajaksu2linkissue1500504 dependencies
2009-02-13 01:40:20ajaksu2setstage: test needed
type: enhancement
versions: + Python 2.7
2009-02-13 01:36:07ajaksu2linkissue1591035 dependencies
2008-08-25 13:14:36orsenthilsetmessages: + msg71919
2008-07-03 19:10:38facundobatistasetassignee: facundobatista
messages: + msg69227
nosy: + facundobatista, orsenthil
2008-01-05 11:56:20vilasetnosy: + vila
2007-11-22 11:21:43vincentksetmessages: + msg57756
2007-11-21 10:29:37vincentksetmessages: + msg57736
2007-11-20 18:57:05vincentksetnosy: + vincentk
messages: + msg57705
2007-10-29 16:38:16ijmorlansetfiles: + testurlparse.py
nosy: + ijmorlan
messages: + msg56909
2007-08-30 22:26:17skip.montanarosetnosy: + skip.montanaro
2006-04-01 03:30:42pauljcreate