Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlparse normalize URL path #48441

Closed
monkeboy mannequin opened this issue Oct 24, 2008 · 5 comments
Closed

urlparse normalize URL path #48441

monkeboy mannequin opened this issue Oct 24, 2008 · 5 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@monkeboy
Copy link
Mannequin

monkeboy mannequin commented Oct 24, 2008

BPO 4191
Nosy @orsenthil, @devdanzin

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-02-18.14:38:01.605>
created_at = <Date 2008-10-24.07:50:32.630>
labels = ['type-bug']
title = 'urlparse normalize URL path'
updated_at = <Date 2009-02-18.14:38:01.603>
user = 'https://bugs.python.org/monkeboy'

bugs.python.org fields:

activity = <Date 2009-02-18.14:38:01.603>
actor = 'ajaksu2'
assignee = 'none'
closed = True
closed_date = <Date 2009-02-18.14:38:01.605>
closer = 'ajaksu2'
components = []
creation = <Date 2008-10-24.07:50:32.630>
creator = 'monk.e.boy'
dependencies = []
files = []
hgrepos = []
issue_num = 4191
keywords = []
message_count = 5.0
messages = ['75154', '75851', '81846', '82401', '82416']
nosy_count = 4.0
nosy_names = ['jjlee', 'orsenthil', 'ajaksu2', 'monk.e.boy']
pr_nums = []
priority = 'low'
resolution = None
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue4191'
versions = ['Python 2.6']

@monkeboy
Copy link
Mannequin Author

monkeboy mannequin commented Oct 24, 2008

Hi,

The way urljoin works is a bit funky, equivalent paths do not get
cleaned in a consistent way:

import urlparse
import posixpath

print urlparse.urljoin('http://www.example.com', '///')
print urlparse.urljoin('http://www.example.com/', '///')
print urlparse.urljoin('http://www.example.com///', '///')
print urlparse.urljoin('http://www.example.com///', '//')
print urlparse.urljoin('http://www.example.com///', '/')
print urlparse.urljoin('http://www.example.com///', '')
print
# the above should reduce down to:
print posixpath.normpath('///')
print
print urlparse.urljoin('http://www.example.com///', '.')
print urlparse.urljoin('http://www.example.com///', '/.')
print urlparse.urljoin('http://www.example.com///', './')
print urlparse.urljoin('http://www.example.com///', '/.')
print
print posixpath.normpath('/.')
print
print urlparse.urljoin('http://www.example.com///', '..')
print urlparse.urljoin('http://www.example.com', '/a/../a/')
print urlparse.urljoin('http://www.example.com', '../')
print urlparse.urljoin('http://www.example.com', 'a/../a/')
print urlparse.urljoin('http://www.example.com', 'a/../a/./')
print urlparse.urljoin('http://www.example.com/a/../a/', '../a/./../a/')
print urlparse.urljoin('http://www.example.com/a/../a/', '/../a/./../a/')

The results of the above code are:

http://www.example.com/
http://www.example.com/
http://www.example.com/
http://www.example.com///
http://www.example.com/
http://www.example.com///

/

http://www.example.com///
http://www.example.com/.
http://www.example.com///
http://www.example.com/.

/

http://www.example.com
http://www.example.com/.
http://www.example.com
http://www.example.com/.

http://www.example.com//
http://www.example.com/a/../a/
http://www.example.com/../
http://www.example.com/a/
http://www.example.com/a/
http://www.example.com/a/
http://www.example.com/../a/./../a/

Sometimes the path is cleaned, sometimes it is not. When it is cleaned,
the cleaning process is not perfect.

The bit of code that is causing problems is commented:

# XXX The stuff below is bogus in various ways...

If I may be so bold, I would like to see this URL cleaning code stripped
from urljoin.

A new method/function could be added that cleans a URL. It could have a
'mimic browser' option, because a browser *will* follow URLs like:
http://example.com/../../../ (see this non-bug
http://bugs.python.org/issue2583 )

The URL cleaner could use some of the code from "posixpath". Shorter
URLs would be preferred over longer (e.g: http://example.com preferred
to http://example.com/ )

Thanks,

monk.e.boy

@orsenthil
Copy link
Member

This report almost seems like a bug with urlparse, but it is not. We
have to consider certain cases here.

  1. First of all, we cannot equate urlparsing, urlsplit, urljoin with
    path normalization provided by posixpath.normalize. The reason is the
    url syntax is strictly by RFCs which are different than Operating
    system's file and directory naming syntaxes. So, the expectation that
    urlparse() should return the same result as posixpath.normalize() is
    wrong. What we can at most look is, does urlparse follow the guidelines
    mentioned in the RFC1808 to start with and RFC3986 ( Current).

  2. Secondly, in a generic sense, it is better to follow the RFC defined
    parsing rules for URLS than implementing browser behavior. Because, the
    urlparse needs to parse urls of other schemes also say svn+ssh where a
    valid url is svn+ssh://localhost///// and in this case '////' is the the
    name of my directory where I have the source code. Quite possible,
    right? So, it should not be converted to '/' which will be wrong.

  3. And coming down to the more specific issues with the examples
    presented in this report,
    urlsplit considers the first '//' to follow the netloc
    and a single '/' or '///' to be path '/'

>>> urlparse.urlsplit('//')
SplitResult(scheme='', netloc='', path='', query='', fragment='')
>>> urlparse.urlsplit('/')
SplitResult(scheme='', netloc='', path='/', query='', fragment='')
>>> urlparse.urlsplit('///')
SplitResult(scheme='', netloc='', path='/', query='', fragment='')

Having this in mind, follow the examples you have provided:

print urlparse.urljoin('http://www.example.com///', '//')
print urlparse.urljoin('http://www.example.com///', '/')
print urlparse.urljoin('http://www.example.com///', '')

You will find that they are according the parsing and joining rules as
defined in RFC 1808 (http://www.faqs.org/rfcs/rfc1808.html)

The same is with other examples, monk.e.boy.

If you see that urlparse method has a problem, then please point me to
the section in the RFC1808/RFC3986, where it is not confirming, I shall
work on the patch to fix.

This report, is not a valid bug and can be closed.

@orsenthil orsenthil added the type-bug An unexpected behavior, bug, or error label Nov 14, 2008
@devdanzin
Copy link
Mannequin

devdanzin mannequin commented Feb 13, 2009

Will close soon if nobody is against it.

@orsenthil
Copy link
Member

Please close this, Daniel.

@devdanzin
Copy link
Mannequin

devdanzin mannequin commented Feb 18, 2009

Thanks Senthil!

@devdanzin devdanzin mannequin closed this as completed Feb 18, 2009
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant