Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.parse.urlparse() accepts wrong URLs #64470

Open
serhiy-storchaka opened this issue Jan 15, 2014 · 14 comments
Open

urllib.parse.urlparse() accepts wrong URLs #64470

serhiy-storchaka opened this issue Jan 15, 2014 · 14 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 20271
Nosy @gvanrossum, @orsenthil, @vadmium, @serhiy-storchaka, @1st1, @demianbrecht, @tirkarthi
Dependencies
  • bpo-20270: urllib.parse doesn't work with empty port
  • Files
  • urlparse.patch
  • urlparse_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2014-01-15.12:32:09.899>
    labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
    title = 'urllib.parse.urlparse() accepts wrong URLs'
    updated_at = <Date 2022-01-17.12:57:05.760>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2022-01-17.12:57:05.760>
    actor = 'vstinner'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2014-01-15.12:32:09.899>
    creator = 'serhiy.storchaka'
    dependencies = ['20270']
    files = ['34300', '38322']
    hgrepos = []
    issue_num = 20271
    keywords = ['patch']
    message_count = 14.0
    messages = ['208156', '208675', '209852', '212888', '212889', '212890', '236990', '237001', '237157', '237369', '237371', '238982', '338974', '355315']
    nosy_count = 7.0
    nosy_names = ['gvanrossum', 'orsenthil', 'martin.panter', 'serhiy.storchaka', 'yselivanov', 'demian.brecht', 'xtreak']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue20271'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @serhiy-storchaka
    Copy link
    Member Author

    >>> import urllib.parse
    >>> p = urllib.parse.urlparse('http://[::1]spam:80')
    >>> p
    ParseResult(scheme='http', netloc='[::1]spam:80', path='', params='', query='', fragment='')
    >>> p.hostname
    '::1'
    >>> p.port
    80

    'http://[::1]spam:80' is invalid URL, but urllib.parse.urlparse() accepts it and just ignore the spam part.

    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 15, 2014
    @serhiy-storchaka
    Copy link
    Member Author

    What should be correct behavior? Raise an exception, return '[::1]spam' as hostname, or left all as is?

    @1st1
    Copy link
    Member

    1st1 commented Feb 1, 2014

    I think we should start raising an exception in 3.5 (backwards incompatible change to back-port it)

    @vstinner
    Copy link
    Member

    vstinner commented Mar 7, 2014

    Here is a patch for Python 3.5 which breaks backward compatibility: urlparse functions now raise a ValueError if the IPv6 address, port or host is invalid.

    Examples of invalid URLs:

    According to unit tests, Python 3.4 is more tolerant: it only raises an error when the port number is read (obj.port) from an URL with an invalid port. There error is not raised when the whole URL is parsed.

    Is it ok to break backward compatibility?

    @vstinner
    Copy link
    Member

    vstinner commented Mar 7, 2014

    Oh, by the way my patch fixes also bpo-18191.

    @vstinner
    Copy link
    Member

    vstinner commented Mar 7, 2014

    My patch "urlparse.patch" may be modified to fix also bpo-18191 in Python 2.7, 3.3 and 3.4: splitport() should handle IPv6 ("[::1]:80") and auth ("user:passowrd@host") but not raises an exception.

    @serhiy-storchaka
    Copy link
    Member Author

    splitport() shouldn't handle auth, it should be called after auth is dropped with splituser().

    The patch makes URL parsing slower, especially URLs with IPv6 addresses.

    $ ./python -m timeit -s "from urllib.parse import urlparse, clear_cache" -- "urlparse('http://python.org:80'); clear_cache()"
    Unpatched: 10000 loops, best of 3: 70.3 usec per loop
    Patched:   10000 loops, best of 3: 105 usec per loop
    
    $ ./python -m timeit -s "from urllib.parse import urlparse, clear_cache" -- "urlparse('http://[2001:4802:7901:0:e60a:1375:0:5]:80'); clear_cache()"
    Unpatched: 10000 loops, best of 3: 71.1 usec per loop
    Patched:   1000 loops, best of 3: 239 usec per loop

    @vadmium
    Copy link
    Member

    vadmium commented Mar 2, 2015

    I suspect the proposed patch here would make some of the special error handling for the “port” property redundant, and would compete with my patch for bpo-20059, which raises ValueError in more cases when “port” is accessed, rather than returning None.

    There should also be test cases (and maybe documentation or fixes?) to illustrate the behaviour of parsing:

    @serhiy-storchaka
    Copy link
    Member Author

    With optimizations in bpo-23563 and weaken IPv6 check the implementation of urlsplit() can be faster.

    $ ./python -m timeit -s "from urllib.parse import urlparse, clear_cache" -- "urlparse('http://python.org:80'); clear_cache()"
    10000 loops, best of 3: 86.3 usec per loop
    $ ./python -m timeit -s "from urllib.parse import urlparse, clear_cache" -- "urlparse('http://[2001:4802:7901:0:e60a:1375:0:5]:80'); clear_cache()"
    10000 loops, best of 3: 88.6 usec per loop

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 6, 2015

    I think some further consideration around this change is worthwhile:

    Currently, urllib.parse.split* methods seem to do inconsistent validation around the data they're splitting. For example:

    (None for an invalid port)
    >>> parse.splitnport('example.com:foo')
    ('example.com', None)

    Whereas other split* methods do not:

    (Auth part should be URL-encoded)
    >>> parse.splituser('u@ser:p@ssword@example.com:80')
    ('u@ser:p@ssword', 'example.com:80')

    And others are just plain incorrect:

    (:bad should be the port as defined by the ABNF 'authority = [ userinfo "@" ] host [ ":" port ]')
    >>> parse.splitport('example.com:bad')
    ('example.com:bad', None)

    However, none of these cases (currently) raise exceptions.

    Looking at urllib.parse, two large motivations behind it are splitting and parsing. In my mind, splitting should solely be responsible for splitting input based on RFC-defined delimiters. Parsing on the other hand, should be responsible for both splitting as necessary as well as input validation. It may also make sense to add simple validation functions to the module to comply with the "batteries included" philosophy, but that's a topic for another issue.

    My concern with the proposed patch is that it adds further inconsistency to split* methods:

    Before patch:

    >>> parse.urlsplit('http://[::1]spam:80')
    SplitResult(scheme='http', netloc='[::1]spam:80', path='', query='', fragment='')

    After patch:

    >>> parse.urlsplit('http://[::1]spam:80')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Volumes/src/p/cpython/Lib/urllib/parse.py", line 350, in urlsplit
        netloc, url = _splitnetloc(url, 2)
      File "/Volumes/src/p/cpython/Lib/urllib/parse.py", line 324, in _splitnetloc
        raise ValueError('Invalid IPv6 URL')

    (While the above examples still yield the same results and don't raise exceptions)

    I do think that the validation should be done and I agree that an exception should be raised, but it should be done at the urlparse level, not on split (in this case, due to the change to _splitnecloc).

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Mar 6, 2015

    splitport() shouldn't handle auth, it should be called after auth is dropped with splituser().

    Why shouldn't splitport() handle auth? Are you suggesting that in the eyes of splitport() that "user:password@host:port" should be invalid?

    >>> parse.splitport('user:password@host:80')
    ('user:password@host', '80')

    Seems to be reasonable behaviour to me. Why add the artificial constraint of calling splituser() first?

    @vadmium
    Copy link
    Member

    vadmium commented Mar 23, 2015

    To me, the only difference between urlsplit() and urlparse() is that urlsplit() does not treat parameters following a semicolon specially. The documentation tends to agree with this view. So whatever exceptions are raised by urlparse() should also be raised by urlsplit() for consistency.

    Also, I doubt it is worth worrying about the finer details of splitport() or whatever, because they are meant to be deprecated and not used externally, except for backwards compatibility.

    @tirkarthi
    Copy link
    Member

    See also bpo-36338 for a possible security issue for host of value "benign.com[attacker.com]" (spam[::1] format) where attacker.com is parsed as the host name assuming presence of [ and ] to be a IPV6 address without validation of the value attacker.com inside [] to be a valid IPV6 address.

    As a datapoint input "http://[::1]spam" raises exception in Java, golang and Ruby. Browser's JS console returns invalid URL. I too would like exception being raised but not sure at which level.

    Ruby seems to use a regex : https://github.com/ruby/ruby/blob/trunk/lib/uri/rfc3986_parser.rb#L6
    Java parseurl : http://hg.openjdk.java.net/jdk/jdk/file/c4c225b49c5f/src/java.base/share/classes/java/net/URLStreamHandler.java#l124
    golang : https://github.com/golang/go/blob/50bd1c4d4eb4fac8ddeb5f063c099daccfb71b26/src/net/url/url.go#L587

    See also https://url.spec.whatwg.org/#host-parsing

    If input starts with U+005B ([), then:

    If input does not end with U+005D (]), validation error, return failure.
    
    Return the result of IPv6 parsing input with its leading U+005B ([) and trailing U+005D (]) removed.
    

    @tirkarthi tirkarthi added 3.7 (EOL) end of life 3.8 only security fixes labels Mar 27, 2019
    @tirkarthi
    Copy link
    Member

    With Victor's PR 16780, "http://[::1]spam:80" would raise a ValueError.

    ValueError: Invalid network location: '[::1]spam:80'

    @iritkatriel iritkatriel added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes and removed 3.7 (EOL) end of life 3.8 only security fixes labels Jan 16, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants