Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlparse, urlsplit confused when password includes fragment (#), query (?) #62340

Closed
anhle mannequin opened this issue Jun 5, 2013 · 9 comments
Closed

urlparse, urlsplit confused when password includes fragment (#), query (?) #62340

anhle mannequin opened this issue Jun 5, 2013 · 9 comments
Assignees
Labels
3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@anhle
Copy link
Mannequin

anhle mannequin commented Jun 5, 2013

BPO 18140
Nosy @orsenthil, @vadmium
Dependencies
  • bpo-30500: [security] urllib connects to a wrong host
  • Files
  • password_delimiters.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2020-12-23.19:55:46.947>
    created_at = <Date 2013-06-05.09:45:03.336>
    labels = ['invalid', 'type-bug', 'library', '3.10']
    title = 'urlparse, urlsplit confused when password includes fragment (#), query (?)'
    updated_at = <Date 2020-12-23.19:55:46.946>
    user = 'https://bugs.python.org/anhle'

    bugs.python.org fields:

    activity = <Date 2020-12-23.19:55:46.946>
    actor = 'orsenthil'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2020-12-23.19:55:46.947>
    closer = 'orsenthil'
    components = ['Library (Lib)']
    creation = <Date 2013-06-05.09:45:03.336>
    creator = 'anh.le'
    dependencies = ['30500']
    files = ['30652']
    hgrepos = []
    issue_num = 18140
    keywords = ['patch']
    message_count = 9.0
    messages = ['190646', '190677', '190690', '190692', '191497', '235586', '334041', '375109', '383658']
    nosy_count = 6.0
    nosy_names = ['orsenthil', 'martin.panter', 'dmi.baranov', 'madison.may', 'anh.le', 'david.six']
    pr_nums = []
    priority = 'normal'
    resolution = 'not a bug'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue18140'
    versions = ['Python 3.10']

    @anhle
    Copy link
    Mannequin Author

    anhle mannequin commented Jun 5, 2013

    >>> u = 'http://auser:secr#et@192.168.0.1:8080/a/b/c.html'
    >>> urlparse.urlsplit(u)
    SplitResult(scheme='http', netloc='auser:secr', path='', query='', fragment='et@192.168.0.1:8080/a/b/c.html')

    @anhle anhle mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jun 5, 2013
    @orsenthil orsenthil self-assigned this Jun 5, 2013
    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Jun 5, 2013

    urllib.parse.urlsplit() in Python3.3 behaves the same way. Since urlsplit takes an optional param "allow_fragments", I don't think it should be a high priority issue.

    The relevant code from Python3.3 is below, however:

        if allow_fragments and '#' in url:
            url, fragment = url.split('#', 1)
        if '?' in url:
            url, query = url.split('?', 1)

    Note that passwords containing '?' would produce a similar result, which is perhaps mildly more concerning, as there is no flag to ignore the query portion of the url.

    That being said, I'm against making any changes to urlsplit at this point, since that would also require modifying urlunsplit and may in fact make it much more difficult (or impossible) to rejoin a url. The strength of the very simple implementation we have currently is that it's always reversible.

    @dmibaranov
    Copy link
    Mannequin

    dmibaranov mannequin commented Jun 5, 2013

    anh, nice catch.

    Madison, I'm not see any terrible consequences, only urlsplit is affected now:
    >>> urlsplit('http://user:password@domain.com:80/path?query#fragment')
    SplitResult(scheme='http', netloc='user:password@domain.com:80', path='/path', query='query', fragment='fragment')
    >>> urlunsplit(_)
    'http://user:password@domain.com:80/path?query#fragment'
    >>> urlunsplit(('http', 'user:pass#ord@domain.com:80', '/path', 'query', 'fragment'))
    'http://user:pass#ord@domain.com:80/path?query#fragment'

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Jun 5, 2013

    My apologies -- that was an oversight on my part. Now that I look at the issue again, it's plain that it most likely won't be an issue. Problems only arise when you allow '#' to occur before '?' and then treat portion of the url between the '#' and '?' as a fragment. In this scenario, when the url is rejoined, the order of the fragment and query is switched by urlunsplit().

    However, since the portion of the url with a hashtag would be treated as a the netloc, this isn't any cause for concern. We only need to ensure that the order of the components is maintained.

    Thanks for the heads up, Dmi.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Jun 19, 2013

    Here's a potential patch for the issue, should we decide it's worth fixing.

    All current tests pass with the update version of _splitnetloc(), and I've added a test to test_urlparse to check that urls with '#' or '?' in the password field are treated properly.

    @vadmium
    Copy link
    Member

    vadmium commented Feb 9, 2015

    Possibly related: bpo-23328. I would have thought any special characters in the password etc field should be encoded. Where do such unencoded URLs come from?

    It seems that this would change the behaviour of parsing a URL such as

    http://host/path/with/unencoded/@/character

    Is one of these scenarios more valid than the other?

    @vadmium vadmium changed the title urlparse.urlsplit confused to fragment when password include # urlparse, urlsplit confused when password includes fragment (#), query (?) Mar 16, 2016
    @vadmium
    Copy link
    Member

    vadmium commented Jan 19, 2019

    Today I read RFC 3986, and I think the URLs in the bug reports are valid, and are already parsed correctly. The path is allowed to have a literal “at” symbol:

    path-abempty = *( "/" segment )
    segment = *pchar
    pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

    The query and fragment are allowed to have “at” and question marks:

    query = *( pchar / "/" / "?" )
    fragment = *( pchar / "/" / "?" )

    So I think this could be closed because the parsing is working correctly.

    @davidsix
    Copy link
    Mannequin

    davidsix mannequin commented Aug 10, 2020

    tl;dr: '#', '?' and a few other characters should be URL-encoded/%-encoded when they appear in userinfo which will already parse correctly.

    ---

    Following up on what Martin said, RFC 3986 has the specifications for how these examples should be parsed.

    userinfo = *( unreserved / pct-encoded / sub-delims / ":" )

    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    pct-encoded   = "%" HEXDIG HEXDIG
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                     / "*" / "+" / "," / ";" / "="

    Notably, gen-delims are _not_ included in the allowed characters, nor are non-ASCII characters.

    gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

    These and other characters not mentioned should be URL-encoded/%-encoded if they appear in the password.

    Taking the first example:

    >>> from urllib.parse import urlparse
    >>> u = 'http://auser:secr%23et@192.168.0.1:8080/a/b/c.html'
    >>> urlparse(u)
    ParseResult(scheme='http', netloc='auser:secr%23et@192.168.0.1:8080', path='/a/b/c.html', params='', query='', fragment='')
    >>> unquote(urlparse(u).password)
    'secr#et'

    @orsenthil orsenthil added the 3.10 only security fixes label Dec 19, 2020
    @orsenthil
    Copy link
    Member

    Not a bug. The message #msg375109 explains how to quote and unquote the '#' in the password field, and demonstrates how urllib parses it correctly.

    I guess, it was set to open as a mistake. Closing it again.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants