Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlparse.urlparse misparses URLs with query but no path #45978

Closed
nagle mannequin opened this issue Dec 16, 2007 · 6 comments
Closed

urlparse.urlparse misparses URLs with query but no path #45978

nagle mannequin opened this issue Dec 16, 2007 · 6 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@nagle
Copy link
Mannequin

nagle mannequin commented Dec 16, 2007

BPO 1637
Nosy @gvanrossum

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2008-01-05.22:19:22.287>
created_at = <Date 2007-12-16.17:29:43.556>
labels = ['type-bug', 'library']
title = 'urlparse.urlparse misparses URLs with query but no path'
updated_at = <Date 2008-01-05.23:35:12.431>
user = 'https://bugs.python.org/nagle'

bugs.python.org fields:

activity = <Date 2008-01-05.23:35:12.431>
actor = 'gvanrossum'
assignee = 'none'
closed = True
closed_date = <Date 2008-01-05.22:19:22.287>
closer = 'gvanrossum'
components = ['Library (Lib)']
creation = <Date 2007-12-16.17:29:43.556>
creator = 'nagle'
dependencies = []
files = []
hgrepos = []
issue_num = 1637
keywords = []
message_count = 6.0
messages = ['58673', '58761', '58842', '58846', '59337', '59339']
nosy_count = 3.0
nosy_names = ['gvanrossum', 'nagle', 'vila']
pr_nums = []
priority = 'normal'
resolution = 'accepted'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue1637'
versions = ['Python 2.5', 'Python 2.4']

@nagle nagle mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 16, 2007
@nagle
Copy link
Mannequin Author

nagle mannequin commented Dec 16, 2007

urlparse.urlparse will mis-parse URLs which have a "/" after a "?".

>
> sa1 = 'http://example.com?blahblah=/foo'
> sa2 = 'http://example.com?blahblah=foo'
> print urlparse.urlparse(sa1)
> ('http', 'example.com?blahblah=', '/foo', '', '', '') # WRONG
> print urlparse.urlparse(sa2)
> ('http', 'example.com', '', '', 'blahblah=foo', '') # RIGHT

That's wrong. RFC3896 ("Uniform Resource Identifier (URI): Generic
Syntax"), page 23 says

"The characters slash ("/") and question mark ("?") may represent data
within the query component.  Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators."

So "urlparse" is an "older, erroneous implementation". Looking
at the code for "urlparse", it references RFC1808 (1995), which
was a long time ago, three revisions back.

>
> Here's the bad code:
>
> def _splitnetloc(url, start=0):
> for c in '/?#': # the order is important!
> delim = url.find(c, start)
> if delim >= 0:
> break
> else:
> delim = len(url)
> return url[start:delim], url[delim:]
>
> That's just wrong. The domain ends at the first appearance of
> any character in '/?#', but that code returns the text before the
> first '/' even if there's an earlier '?'. A URL/URI doesn't
> have to have a path, even when it has query parameters.

OK, here's a fix to "urlparse", replacing _splitnetloc. I didn't use
a regular expression because "urlparse" doesn't import "re", and I
didn't want to change that.

def _splitnetloc(url, start=0):
    delim = len(url)# position of end of domain part of url, default is end
    for c in '/?#':    # look for delimiters; the order is NOT important   
        wdelim = url.find(c, start)    # find first of this delim
        if wdelim >= 0:            # if found
            delim = min(delim, wdelim)# use earliest delim position
    return url[start:delim], url[delim:]    # return (domain, rest)

@gvanrossum
Copy link
Member

Would you mind submitting a proper patch for Python 2.5 and/or 2.6
generated by "svn diff" relative to an (anonymous) checkout, and adding
a unit test? Then I'd be happy to accept this and if it makes it in
time for the 2.5.2 release we'll fix it there.

@nagle
Copy link
Mannequin Author

nagle mannequin commented Dec 19, 2007

I tried downloading the latest rev of urlparse.py (59480) and it flunked
its own unit test, "urlparse.test()" Two test cases fail. So I don't
want to try to fix the module until the last people to change it fix
their unit test problems.
The fix I provided should fix the problem I reported, but I'm not sure
if there's anything else wrong, since it flunks its unit test.

@gvanrossum
Copy link
Member

I tried downloading the latest rev of urlparse.py (59480) and it flunked
its own unit test, "urlparse.test()" Two test cases fail.

That's not the official test -- that code should probably be deleted.
The real test is in Lib/test/test_urlparse.py. Please ignore that test.

@gvanrossum
Copy link
Member

I checked in your change and made up a test.

Committed revision 59758.

Thanks!

@gvanrossum
Copy link
Member

Backport to 2.5.2:
Committed revision 59760.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant