Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC2732 support for urlparse (IPv6 addresses) #47236

Closed
ndim mannequin opened this issue May 27, 2008 · 25 comments
Closed

RFC2732 support for urlparse (IPv6 addresses) #47236

ndim mannequin opened this issue May 27, 2008 · 25 comments
Assignees
Labels
easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@ndim
Copy link
Mannequin

ndim mannequin commented May 27, 2008

BPO 2987
Nosy @orsenthil, @pitrou, @benjaminp, @merwok, @bitdancer, @anacrolix
Files
  • python-urlparse-rfc2732-fix.patch: preliminary urlparse fix, requires more thought
  • python-urlparse-rfc2732-rfc-list.patch: update RFC list on top of urlparse.py
  • python-urlparse-rfc2732-test.patch: test cases with RFC2732 urls to parse
  • parse.py.patch: Patch to parse.py and test_urlparse.py
  • urlparse-module-header.diff
  • issue2987-final.patch
  • issue2987-bad_url_checks.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2010-04-16.03:07:58.918>
    created_at = <Date 2008-05-27.20:40:36.606>
    labels = ['easy', 'type-feature', 'library']
    title = 'RFC2732 support for urlparse (IPv6 addresses)'
    updated_at = <Date 2011-01-20.09:49:49.325>
    user = 'https://bugs.python.org/ndim'

    bugs.python.org fields:

    activity = <Date 2011-01-20.09:49:49.325>
    actor = 'anacrolix'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2010-04-16.03:07:58.918>
    closer = 'orsenthil'
    components = ['Library (Lib)']
    creation = <Date 2008-05-27.20:40:36.606>
    creator = 'ndim'
    dependencies = []
    files = ['10451', '10452', '10453', '16888', '16892', '16931', '16962']
    hgrepos = []
    issue_num = 2987
    keywords = ['patch', 'easy']
    message_count = 25.0
    messages = ['67430', '67431', '98314', '98315', '102874', '102876', '102881', '102882', '102884', '102886', '102911', '102915', '102920', '103065', '103066', '103067', '103226', '103255', '103285', '103288', '103312', '103410', '103419', '103430', '103753']
    nosy_count = 11.0
    nosy_names = ['jjlee', 'orsenthil', 'pitrou', 'benjamin.peterson', 'ndim', 'eric.araujo', 'r.david.murray', 'sergiomb2', 'anacrolix', 'tlocke', 'Keegan.Carruthers-Smith']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue2987'
    versions = ['Python 2.7', 'Python 3.2']

    @ndim
    Copy link
    Mannequin Author

    ndim mannequin commented May 27, 2008

    The urlparse module's ways of splitting the location into hostname and
    port breaks with RFC2732 style URIs with IPv6 addresses in them:

    >>> import urlparse
    >>> urlparse.urlparse('http://[::1]:80/').hostname
    '['
    >>> urlparse.urlparse('http://[::1]:80/').port
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.5/urlparse.py", line 116, in port
        return int(port, 10)
    ValueError: invalid literal for int() with base 10: ':1]:80'
    >>> 

    A simple fix is attached, but probably requires a little more thought.

    @ndim ndim mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels May 27, 2008
    @ndim
    Copy link
    Mannequin Author

    ndim mannequin commented May 27, 2008

    I have written this patch because urlparse could not retrieve the
    hostname or port components of URIs such as
    http://[::ffff:192.168.13.37]/ or http://[dead:beef::1]:8888/

    This problem happens with Python 2.5.1 in Fedora 9, and I have also
    found it in Python's SVN trunk/ and release25-maint/ source code.

    It still needs some polishing and thinking: See the places marked
    FIXME, but probably also others. One would not want an inconsistent
    API feel with respect to IPv6 address handling.

    Might require some more comprehensive thought about how Python wants
    to handle networks other-than-IPv4, exceeding the scope of a simple
    patch to urlparse.py.

    On a not-totally-unrelated note, someone should examine whether IRIs[1]
    can fit into urlparse.py, or whether they need e.g. a separate
    iriparse.py with an adapted API.

    [1] RFC 3987 - Internationalized Resource Identifiers (IRIs)
    M. Duerst, M. Suignard, January 2005

    @devdanzin devdanzin mannequin added the easy label Apr 22, 2009
    @sergiomb2
    Copy link
    Mannequin

    sergiomb2 mannequin commented Jan 26, 2010

    Hi, with python-2.6.2-2.fc12.i686

    In: x ="http://www.somesite.com/images/rubricas/"
    In: urlparse.urljoin(x, '07.11.2009-9:54:12-1.jpg')
    Out: '07.11.2009-9:54:12-1.jpg' !?

    In: urlparse.urljoin(x, './07.11.2009-9:54:12-1.jpg')
    Out: 'http://www.somesite.com/images/rubricas/07.11.2009-9:54:12-1.jpg'

    urlparse.urlparse('07.11.2009-9:54:12-1.jpg')
    is wrong
    but 
    urlparse.urlparse('./07.11.2009-9:54:12-1.jpg')
    isn't. 

    think about that please

    @orsenthil
    Copy link
    Member

    okay, this should be easy to address. But the more important part is RFC compliance so that this simple change does not break many other things in the wild.

    @orsenthil orsenthil self-assigned this Jan 26, 2010
    @tlocke
    Copy link
    Mannequin

    tlocke mannequin commented Apr 11, 2010

    I've created a patch for parse.py against the py3k branch, and I've also included ndim's test cases in that patch file.

    When returning the host name of an IPv6 literal, I don't include the surrounding '[' and ']'. For example, parsing http://[::1]:5432/foo/ gives the host name '::1'.

    @merwok
    Copy link
    Member

    merwok commented Apr 11, 2010

    Seems sensible: Delimiters are not part of components.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 11, 2010

    I think parsing should be a bit more careful. For example, what happens when you give 'http://dead:beef::]/foo/' as input (note the missing opening bracket)?

    @pitrou
    Copy link
    Member

    pitrou commented Apr 11, 2010

    By the way, updating the RFC list as done in python-urlparse-rfc2732-rfc-list.patch is also a good idea.

    @merwok
    Copy link
    Member

    merwok commented Apr 11, 2010

    Isn’t “http://dead:beef::]/foo/“ and invalid URI?

    Regarding doc, see also bpo-5650.

    @merwok merwok changed the title RFC2732 support for urlparse (e.g. http://[::1]:80/) RFC2732 support for urlparse (e.g. http:// Apr 11, 2010
    @pitrou
    Copy link
    Member

    pitrou commented Apr 11, 2010

    Isn’t “http://dead:beef::]/foo/“ and invalid URI?

    That's the point, it shouldn't parse as a valid one IMO.

    @pitrou pitrou changed the title RFC2732 support for urlparse (e.g. http:// RFC2732 support for urlparse (IPv6 addresses) Apr 11, 2010
    @tlocke
    Copy link
    Mannequin

    tlocke mannequin commented Apr 11, 2010

    Regarding the RFC list issue, I've posted a new patch with a new RFC list that combines ndim's list and the comments from bpo-5650.

    Pitrou argues that http://dead:beef::]/foo/ should fail because it's a malformed URL. My response would be that the parse() function has historically assumed that a URL is well formed, and so this change to accommodate IPv6 should continue to assume the URL is well formed.

    I'd say that a separate bug should be raised if it's thought that parse() should be changed to check that any URL is well-formed.

    @orsenthil
    Copy link
    Member

    With respect to msg98314 (http://bugs.python.org/msg98314) referenced in this bug, which I thought is easy to handle, does not appear so. It is bit tricky.

    The problem is the relative url is given of the format '07.11.2009-9:54:12-1.jpg' and urlparse wrongly assumes that it is VALID url with the scheme as 07.11.2009-9 ( Surprisingly, this falls under valid characters for a URL Scheme, but we know that there no url scheme like that).

    But when you give ./07.11.2009-9, ./ is identified a relative path and urljoin happens properly.

    My inclination for this specific msg9814, is the allow the user to give the proper path like ./07.11.2009-9 or use urljoin from different directory, images/07.11.2009-9 and this should handle it.

    This date-time relative url is not a typical scenario, but for typical scnerios, urlparse behaves as expected.

    >>> x = 'http://a.b.c'
    >>> urlparse.urljoin(x,'foo')
    'http://a.b.c/foo'
    >>> urlparse.urljoin(x,'./foo')
    'http://a.b.c/foo'
    >>> 

    I shall provide my comments on the IPv6 parse in next msg.

    @orsenthil
    Copy link
    Member

    After spending a sufficient amount of time looking at patches and the RFC 2732, I tend to agree with the patch provided by tlocke. It does cover the behavior for parsing IPv6 URL with '[' hostname ']'. RFC 2732 is very short and just says that hostname in the IPv6 should not have '[' and ']' characters. The patch does just that, which is fine.

    If hard pressed on detecting invalid IPv6 , I would add and extra

    + if "[" in netloc and "]" in netloc:
    + return netloc.split("]")[0][1:].lower()
    + elif "[" in netloc or "]" in netloc:
    + raise ValueError("Invalid IPv6 URL")

    Which should take care of Invalid IPv6 urls as discussed in this bug.

    • Any comments on this?

    Also regarding the urlparse header docs, (it was long pending on me and sorry), here is a patch for current one for review. When we address this bug, I shall include RFC 2732 as well in the list.

    @KeeganCarruthers-Smith
    Copy link
    Mannequin

    KeeganCarruthers-Smith mannequin commented Apr 13, 2010

    Just thought I'd point out that RFC2732 was obsoleted by RFC3986 http://www.rfc-editor.org/rfc/rfc3986.txt

    @merwok
    Copy link
    Member

    merwok commented Apr 13, 2010

    Hello

    Thanks for the precision. This particular topic is discussed on bpo-5650, feel free to help there!

    Better update the code before the doc, though.

    Regards

    @orsenthil
    Copy link
    Member

    Actually, this bug is just for parsing the IPv6 url. We are having the
    right set of patches in the bug. I shall commit it soon.
    The RFC part is separate and we will slowly achieve a good compliance
    with STD 66.

    @orsenthil
    Copy link
    Member

    Final patch with inclusion of detecting invalid urls at netloc and hostname level, tests and NEWS entry.

    @benjaminp
    Copy link
    Contributor

    This is ok with me.

    @orsenthil
    Copy link
    Member

    Committed into trunk in revision 80101

    @orsenthil
    Copy link
    Member

    merged into py3k in revision 80102 and release31-maint in revision 80103.

    Thanks for the patches, Tony and Hans. I have acknowledged it in NEWS file too.

    @orsenthil
    Copy link
    Member

    Reverted the check-in made to 3.1 maint (in r80104). Features should not go in there.

    @bitdancer
    Copy link
    Member

    I posted this to the checkins list, but for reference, the following invalid URL should be added to the test cases:

    http://[::1/foo/bar]/bad
    

    @orsenthil
    Copy link
    Member

    Moving the Bad URL check to a higher level can be detect the bad urls much better. Once I the netloc is parsed and obtained, invalid URL can be checked. I am attaching an update with the new test included.
    If you have any comments, please let me know.

    @bitdancer
    Copy link
    Member

    I don't know how deep you want to get into detecting invalid URIs, but with the new patch this one causes a parsing error that is probably worth dealing with:

    http://abc[xyz]jkl

    Maybe a reasonable set of checks would be (in hostname) that if the part of the netloc after the @ contains a ']' or a '[', then it must start with a [ and either end with a ] or contain a ']:'.

    I can also mess up your new checks with something like this:

    http://foo[bar@baz]

    or even:

    http://foo[bar@baz:33]

    although those don't fail, they just faithfully produce the nonsensical results implicit in the invalid urls. I think the above check logic in hostname would catch them, but it wouldn't catch this one:

    http://foo[bar@[bar]:33]

    That may be OK, though, since as you noted earlier we aren't doing full URI validation.

    Oh, and I notice that your test only covers the 'fast' path code, it doesn't exercise the general URI logic.

    (Sorry I didn't review this issue earlier.)

    @orsenthil
    Copy link
    Member

    I added an additional invalid test which David pointed out and made changes to invalid url checking code. I moved it more higher level.

    • The reason for doing this is, invalid url test code (which is very specific for '[' enclosed ']' ipv6 url is concentrated at a single place). We can deal with parsing separately from check.

    Now, other forms of Invalid URLs are possible as David points out (and possibly more too), but leaving it is better as it would unnecessarily add syntax-checks at various different places (instead of a single place), without much of value add. Dealing with Valid URLs and a parse logic checking should be fine.

    commits: trunk - r80277 and py3k - r80278

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants