Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlparse fails at parsing "www.python.org:80/" #61136

Closed
sandrotosi opened this issue Jan 11, 2013 · 11 comments
Closed

urlparse fails at parsing "www.python.org:80/" #61136

sandrotosi opened this issue Jan 11, 2013 · 11 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@sandrotosi
Copy link
Contributor

BPO 16932
Nosy @birkenfeld, @orsenthil, @asvetlov, @sandrotosi, @miss-islington
PRs
  • bpo-27657: Fix urlparse() with numeric paths #661
  • [3.7] bpo-27657: Fix urlparse() with numeric paths (GH-661) #16837
  • [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) #16839
  • Files
  • urlparse.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2013-02-26.09:03:53.168>
    created_at = <Date 2013-01-11.11:29:28.972>
    labels = ['type-bug', 'library']
    title = 'urlparse fails at parsing "www.python.org:80/"'
    updated_at = <Date 2019-10-18.15:23:21.811>
    user = 'https://github.com/sandrotosi'

    bugs.python.org fields:

    activity = <Date 2019-10-18.15:23:21.811>
    actor = 'orsenthil'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2013-02-26.09:03:53.168>
    closer = 'orsenthil'
    components = ['Library (Lib)']
    creation = <Date 2013-01-11.11:29:28.972>
    creator = 'sandro.tosi'
    dependencies = []
    files = ['28692']
    hgrepos = []
    issue_num = 16932
    keywords = []
    message_count = 11.0
    messages = ['179670', '179672', '179702', '179709', '179712', '183029', '183035', '183036', '354891', '354896', '354905']
    nosy_count = 6.0
    nosy_names = ['georg.brandl', 'orsenthil', 'asvetlov', 'sandro.tosi', 'python-dev', 'miss-islington']
    pr_nums = ['661', '16837', '16839']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue16932'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

    @sandrotosi
    Copy link
    Contributor Author

    Hello,
    as reported at http://mail.python.org/pipermail/docs/2013-January/012375.html urlparse fails to parse URLs without a schema and with a url path, as opposed to what's documented at http://docs.python.org/2/library/urlparse.html?highlight=urlparse#urlparse :

    ./python -c "from urlparse import urlparse ; print(urlparse('python.org:80/'))"
    ParseResult(scheme='python.org', netloc='', path='80/', params='', query='', fragment='')

    (that is for 2.7, but the same happens on all the 3.x active branches).

    i'm attaching a test to expose this failure.

    @sandrotosi sandrotosi added the stdlib Python modules in the Lib dir label Jan 11, 2013
    @sandrotosi
    Copy link
    Contributor Author

    Adding Senthil as per expert list

    @birkenfeld
    Copy link
    Member

    This is not a bug: urlparse is there to parse URLs, and URLs start with an URL scheme such as "http:".

    There is no way for a generic URL parser to know that "python.org:80/" is supposed to be "http://python.org:80/".

    @sandrotosi
    Copy link
    Contributor Author

    The documentation reports this example:

    >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
    ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html',
               params='', query='', fragment='')

    but when executing it returns:

    $ ./python -V
    Python 2.7.3+
    $ ./python -c "from urlparse import urlparse ; print urlparse('www.cwi.nl:80/%7Eguido/Python.html')"
    ParseResult(scheme='www.cwi.nl', netloc='', path='80/%7Eguido/Python.html', params='', query='', fragment='')

    which doesn't match.

    @birkenfeld
    Copy link
    Member

    Hmm, you're right. The behavior has been like this at least since Python 2.5:

    Python 2.5.4 (r254:67916, Dec 16 2012, 20:33:12) 
    [GCC 4.6.3] on linux3
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from urlparse import urlparse
    >>> urlparse('www.cwi.nl:80/%7Eguido/Python.html')
    ('www.cwi.nl', '', '80/%7Eguido/Python.html', '', '', '')

    The docs refer to RFC 1808. From a quick glance at the BNF in section 2.2, RFC 1808 allows dots in the scheme, but also allows ":" in the path. So there seems to be a parsing ambiguity, but see section 2.4.2:

    If the parse string contains a colon ":" after the first character
    and before any characters not allowed as part of a scheme name (i.e.,
    any not an alphanumeric, plus "+", period ".", or hyphen "-"), the
    <scheme> of the URL is the substring of characters up to but not
    including the first colon. These characters and the colon are then
    removed from the parse string before continuing.

    That would indicate that the implementation is correct and the documentation should be fixed. Senthil?

    @birkenfeld birkenfeld reopened this Jan 11, 2013
    @orsenthil
    Copy link
    Member

    I am noticing this one late. Sorry for that.
    I agree that this is docs issue and I would like to fix it in this way.

    Give the doc example as:

    >>> urlparse('www.cwi.nl/%7Eguido/Python.html')
    ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', params='', query='', fragment='')

    Instead of

    >> urlparse('www.cwi.nl:80/%7Eguido/Python.html')

    Which introduces a trick ":80" parsing and invokes the rule that Georg pointed out in the message. If I recollect, the point of the example was to point out that URLs (following 1808 RFC) should start with // for their netloc to be identified. Otherwise it is path.

    A ":" on PORT without the "scheme :" is really tricky for any application, so it is right thing for the parser to identify anything before ":" as scheme and the implementation here is correct.

    So, instead of fixing the example to identify the scheme as "www.cwi.nl" which is quite meaningless, the better way to fix the example will be, change the example to urlparse('www.cwi.nl/%7Eguido/Python.html') and the result remains the same.

    I am going ahead with the fix. Thanks.

    @orsenthil orsenthil removed the invalid label Feb 26, 2013
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 26, 2013

    New changeset 33895c474b4d by Senthil Kumaran in branch '2.7':
    Fix bpo-16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
    http://hg.python.org/cpython/rev/33895c474b4d

    New changeset 5442a77b925c by Senthil Kumaran in branch '3.2':
    Fix bpo-16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
    http://hg.python.org/cpython/rev/5442a77b925c

    New changeset 8928205f57f6 by Senthil Kumaran in branch '3.3':
    Fix bpo-16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
    http://hg.python.org/cpython/rev/8928205f57f6

    New changeset 9caad461936e by Senthil Kumaran in branch 'default':
    Fix bpo-16932: Fix the urlparse example. Remote :port when scheme is not specified to demonstrate correct behavior
    http://hg.python.org/cpython/rev/9caad461936e

    @orsenthil
    Copy link
    Member

    I have fixed the docs issue. Thanks for the report and following up.

    @orsenthil orsenthil self-assigned this Feb 26, 2013
    @orsenthil orsenthil added the type-bug An unexpected behavior, bug, or error label Feb 26, 2013
    @orsenthil
    Copy link
    Member

    New changeset 5a88d50 by Senthil Kumaran (Tim Graham) in branch 'master':
    bpo-27657: Fix urlparse() with numeric paths (#661)
    5a88d50

    @miss-islington
    Copy link
    Contributor

    New changeset 82b5f6b by Miss Islington (bot) in branch '3.7':
    bpo-27657: Fix urlparse() with numeric paths (GH-661)
    82b5f6b

    @orsenthil
    Copy link
    Member

    New changeset 0f3187c by Senthil Kumaran in branch '3.8':
    [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (bpo-16839)
    0f3187c

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants