Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib and urllib2 decode userinfo multiple times #46497

Closed
carljm opened this issue Mar 6, 2008 · 10 comments
Closed

urllib and urllib2 decode userinfo multiple times #46497

carljm opened this issue Mar 6, 2008 · 10 comments
Assignees
Labels
easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@carljm
Copy link
Member

carljm commented Mar 6, 2008

BPO 2244
Nosy @freddrake, @orsenthil, @carljm
Files
  • urllib-issue.patch: patch
  • urllib_issue_updated.patch: patch against the trunk.
  • urllib_issue_updated.diff: patch against py3k
  • urllib_ftptests_doubleencode.patch: patch against py3k to add test cases exhibiting original bug.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2010-11-20.11:24:35.102>
    created_at = <Date 2008-03-06.16:10:03.554>
    labels = ['easy', 'type-bug', 'library']
    title = 'urllib and urllib2 decode userinfo multiple times'
    updated_at = <Date 2010-11-20.11:24:35.101>
    user = 'https://github.com/carljm'

    bugs.python.org fields:

    activity = <Date 2010-11-20.11:24:35.101>
    actor = 'orsenthil'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2010-11-20.11:24:35.102>
    closer = 'orsenthil'
    components = ['Library (Lib)']
    creation = <Date 2008-03-06.16:10:03.554>
    creator = 'carljm'
    dependencies = []
    files = ['9621', '18083', '18084', '18295']
    hgrepos = []
    issue_num = 2244
    keywords = ['patch', 'easy']
    message_count = 10.0
    messages = ['63324', '64201', '110858', '110872', '110875', '110876', '112248', '121257', '121476', '121611']
    nosy_count = 9.0
    nosy_names = ['fdrake', 'jafo', 'jjlee', 'orsenthil', 'dstanek', 'carljm', 'jesstess', 'ysj.ray', 'ted.turocy']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2244'
    versions = ['Python 3.1', 'Python 2.7', 'Python 3.2']

    @carljm
    Copy link
    Member Author

    carljm commented Mar 6, 2008

    Both urllib and urllib2 call urllib.unquote() multiple times on data in
    the userinfo section of an FTP URL. One call occurs at the end of the
    urllib.splituser() function. In urllib, the other call appears in
    URLOpener.open_ftp(). In urllib2, the other two occur in
    FTPHandler.ftp_open() and Request.get_host().

    The effect of this is that if the userinfo section of an FTP url should
    need to contain a literal % sign followed by two digits, the % sign must
    be double-encoded as %2525 (for urllib) or triple-encoded as %252525
    (for urllib2) in order for the URL to be accessed.

    The proper behavior would be to only ever unquote a given data segment
    once. The W3's URI: Generic Syntax RFC
    (http://gbiv.com/protocols/uri/rfc/rfc3986.html) addresses this very
    issue in section 2.4 (When to Encode or Decode): "Implementations must
    not percent-encode or decode the same string more than once, as decoding
    an already decoded string might lead to misinterpreting a percent data
    octet as the beginning of a percent-encoding, or vice versa in the case
    of percent-encoding an already percent-encoded string."

    The solution would be to standardize where in urllib and urllib2 the
    unquoting happens, and then make sure it happens nowhere else. I'm not
    familiar enough with the libraries to know where it should be removed
    without possibly breaking other behavior. It seems that just removing
    the map/unquote call in urllib.splituser() would fix the problem in
    urllib. I would guess the call in urllib2 Request.get_host() should
    also be removed, as the RFC referenced above says clearly that only
    individual data segments of the URL should be decoded, not larger
    portions that might contain delimiters (: and @).

    I've attached a patchset for these suggested changes. Very superficial
    testing suggests that the patch doesn't break anything obvious, but I
    make no guarantees.

    @carljm carljm added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Mar 6, 2008
    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Mar 20, 2008

    Fred: You most recently touched the code impacted by this test, does
    this sound reasonable?

    @jafo jafo mannequin assigned freddrake Mar 20, 2008
    @jafo jafo mannequin added performance Performance or resource usage and removed type-bug An unexpected behavior, bug, or error labels Mar 20, 2008
    @devdanzin devdanzin mannequin added the easy label Apr 22, 2009
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 20, 2010

    Re-assigned as per maintainers list.

    @BreamoreBoy BreamoreBoy mannequin assigned orsenthil and unassigned freddrake Jul 20, 2010
    @ysjray
    Copy link
    Mannequin

    ysjray mannequin commented Jul 20, 2010

    By confirming the "%2525" case, I guess this is really a bug. But the patch cause test_urllib2.py failed. I modified the patch to fix it.

    @ysjray
    Copy link
    Mannequin

    ysjray mannequin commented Jul 20, 2010

    This bug exists on py3k also, so I worked out a patch for py3k, too.

    @orsenthil
    Copy link
    Member

    Thanks Ray Allen. Usually the patch against the py3k branch is enough, it will be ported to other branches.

    @tedturocy
    Copy link
    Mannequin

    tedturocy mannequin commented Aug 1, 2010

    I am attaching a patch against py3k which adds some tests to demonstrate the original bug when issuing a Request() to a ftp:// URL.

    To do this, the tests add checks for user and passwd. The previous version of checks asserted that the .user and .passwd of the returned request should be "". Checking .user is necessary to verify the original bug.

    I was confused by a comment in the fixture, "ftp authentication not yet implemented by FTPHandler", which appeared to justify the assumption that .user and .passwd must be "". This may be true, but .user and .passwd are being set by the Request. The test includes the minimal augmentation to extend on the original behavior of the test. Someone with greater knowledge than I might look at that comment to see if it is out-of-date, or simply too vague.

    @bitdancer bitdancer added type-bug An unexpected behavior, bug, or error and removed performance Performance or resource usage labels Aug 1, 2010
    @jesstess
    Copy link
    Member

    I can confirm that the combination of urllib_issue_updated.diff and urllib_ftptests_doubleencode.patch apply cleanly against py3k, that the added tests exercise the described bug, and that the full test suite passes after applying the patches.

    The patches look clean and conforming to PEP-8.

    @orsenthil
    Copy link
    Member

    Fixed in r86520 (py3k) and r86522 (release31-maint).
    unquote happens at the first level inside Request class, _parse method.

    Shall port to py2.7.

    @orsenthil
    Copy link
    Member

    back-ported to release27-maint in r86554.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants