Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib doesn't correct server returned urls #40042

Closed
robzed mannequin opened this issue Mar 17, 2004 · 17 comments
Closed

urllib doesn't correct server returned urls #40042

robzed mannequin opened this issue Mar 17, 2004 · 17 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@robzed
Copy link
Mannequin

robzed mannequin commented Mar 17, 2004

BPO 918368
Nosy @orsenthil, @vstinner, @devdanzin
Dependencies
  • bpo-1153027: http_error_302() crashes with 'HTTP/1.1 400 Bad Request
  • Files
  • redirect_failure.py: Program demonstrating urllib non-correction behaviour and client/server transaction text.
  • issue918368-py27.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2009-05-05.18:47:05.398>
    created_at = <Date 2004-03-17.22:41:36.000>
    labels = ['type-bug', 'library']
    title = "urllib doesn't correct server returned urls"
    updated_at = <Date 2019-04-10.09:11:07.613>
    user = 'https://bugs.python.org/robzed'

    bugs.python.org fields:

    activity = <Date 2019-04-10.09:11:07.613>
    actor = 'vstinner'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2009-05-05.18:47:05.398>
    closer = 'orsenthil'
    components = ['Library (Lib)']
    creation = <Date 2004-03-17.22:41:36.000>
    creator = 'robzed'
    dependencies = ['1153027']
    files = ['1226', '13720']
    hgrepos = []
    issue_num = 918368
    keywords = ['patch']
    message_count = 17.0
    messages = ['20248', '20249', '20250', '20251', '81431', '81472', '81490', '81696', '81816', '81817', '81818', '86160', '86220', '93658', '93671', '93697', '339837']
    nosy_count = 8.0
    nosy_names = ['jhylton', 'mike_j_brown', 'jjlee', 'robzed', 'orsenthil', 'vstinner', 'ajaksu2', 'AdamN']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue918368'
    versions = ['Python 2.7']

    @robzed
    Copy link
    Mannequin Author

    robzed mannequin commented Mar 17, 2004

    On a URL request where the server returns a URL with spaces in,
    urllib doesn't correct it before requesting the new page.

    I think this is technically a server error, however, it does work
    from web browsers (Mozilla, Safari) but not from Python urllib.

    I would suggest that when urllib is following "moved temporarily"
    links (or similar) from a server it translates spaces to %20.

    See example program file for more including detailed server/client
    transactions text.

    @robzed robzed mannequin added stdlib Python modules in the Lib dir labels Mar 17, 2004
    @robzed
    Copy link
    Mannequin Author

    robzed mannequin commented Mar 18, 2004

    Logged In: YES
    user_id=1000470

    I've tested a change to "redirect_internal(self, url, fp, errcode, errmsg,
    headers, data)" in "urllib.py" that adds a single line newurl =
    newurl.replace(" ","%20") after the basejoin() function call that appears
    to fix the problem.

    This information is placed in the public domain.

    @mikejbrown
    Copy link
    Mannequin

    mikejbrown mannequin commented Mar 31, 2004

    Logged In: YES
    user_id=371366

    I agree that it is a server error to put something that doesn't
    meet the syntactic definition of a URI in the Location header
    of a response. I don't see any harm in correcting obvious
    errors, though, in the interest of usability.

    As for your proposed fix, instead of just correcting spaces, I
    would do

        newurl = quote(newurl, safe="/:=&?#+!$,;'@()*[]")

    quote() is a urllib function that does percent-encoding. It is
    way out of date and does poorly with Unicode strings, but if
    called with the above arguments, it should safely clean up
    most mistakes. The set of additional "safe" characters I am
    passing in is the complete set of "reserved" characters
    according to the latest draft of RFC 2396bis; these are
    characters that definitely do or might possibly have special
    meaning in a URL and thus should not be percent-encoded
    blindly.

    I am currently working on a urllib.quote() replacement for
    4Suite's Ft.Lib.Uri library, and will then see about getting the
    improvements folded back into urllib.

    @mikejbrown
    Copy link
    Mannequin

    mikejbrown mannequin commented Mar 31, 2004

    Logged In: YES
    user_id=371366

    In my previous example, I forgot to add a tilde to the "safe"
    characters. Tilde is an "unreserved" character, a class of
    characters allowed in a URI but for which percent-encoding is
    discouraged.

        newurl = quote(newurl, safe="/:=&?~#+!$,;'@()*[]")

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Feb 9, 2009

    Superseded by bpo-1153027 ?

    @robzed
    Copy link
    Mannequin Author

    robzed mannequin commented Feb 9, 2009

    I agree - this appears to be the same as bpo-1153027 ?

    @jjlee
    Copy link
    Mannequin

    jjlee mannequin commented Feb 9, 2009

    This bug refers to urllib. bpo-1153027 refers to urllib2. It's the
    same problem in each case, though.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Feb 12, 2009

    Any idea about how to test this one reliably?

    @devdanzin devdanzin mannequin added type-bug An unexpected behavior, bug, or error labels Feb 12, 2009
    @jjlee
    Copy link
    Mannequin

    jjlee mannequin commented Feb 12, 2009

    Mike's list is missing one more character, "%" itself. So, the
    replacement should be:

    quote(newurl, safe="%/:=&?~#+!$,;'@()*[]")

    The replacement should be done in the .open() method, not in the code
    that handles redirects.

    @jjlee
    Copy link
    Mannequin

    jjlee mannequin commented Feb 12, 2009

    A suitable test would be to derive a urllib.URLOpener subclass that has
    an open_spam(url) method which records the URL. Then call
    .open("spam://") in the test and verify that the URL has been quoted.

    @jjlee
    Copy link
    Mannequin

    jjlee mannequin commented Feb 12, 2009

    Uh, that should have beeen something like
    .open("spam://example.com/%2E/") of course.

    @orsenthil
    Copy link
    Member

    The attached patch fixes this issue and adds the test too. Followed John
    J Lee's suggestion for the patch.Can someone review this?

    For the other referenced bpo-1153027, bug is not reproducible in the
    trunk and I see that fix has been made in revision 43132. It follows the
    approach of percent encoding space at redirect_request. I shall verify
    if we need to change that to a different fix or present one would be
    sufficient.

    @orsenthil orsenthil self-assigned this Apr 19, 2009
    @orsenthil
    Copy link
    Member

    Fixed this in the revision 71780 for Python 2.7.

    @AdamN
    Copy link
    Mannequin

    AdamN mannequin commented Oct 6, 2009

    This seems a bit serious for inclusion in 2.7 IMHO. urllib is used in all
    sorts of hackish ways in the wild and I really wonder if this is going to
    cause more problems for people than it's worth. The 3.x series alone
    seems like the best place for this which is addressed by bpo-1153027.

    @orsenthil
    Copy link
    Member

    AdamN, did you specifically come across a scenario which broke due to
    this change? I can understand your concern, in general. The
    'non-breaking' existing tests is the one of confidence factor we have in
    introducing the changes.

    @AdamN
    Copy link
    Mannequin

    AdamN mannequin commented Oct 7, 2009

    I can't think of too many specific scenarios. It just seems like a non-
    trivial behavior change (or rather, it is trivial but with possibly far
    reaching ramifications).

    One issue I see is that the ticket morphed from just dealing with space
    characters to many special characters. quote_plus() probably would have
    handled the space issue cleanly.

    The other issue I'm concerned about is that there is no testing of non-
    ascii URLs (or at least I don't see any).

    Finally, a netloc like 'www.g oogle.com' would be converted invalidly to
    'www.g%20oogle.com' which would then be forwarded to an actual HTTP
    request.

    I can't point to anything too specific - more of just a feeling that
    this ticket is already 5 1/2 years old and maybe things should be left
    as-is and enhancement made only to Python 3.x. Of course, I'm really
    just a new Python convert so I don't have massive experience here by any
    means.

    @vstinner
    Copy link
    Member

    For the other referenced bpo-1153027, bug is not reproducible in the
    trunk and I see that fix has been made in revision 43132.

    This change was a fix for bpo-1353433:

    commit ddb84d7
    Author: Georg Brandl <georg@python.org>
    Date: Sat Mar 18 11:35:18 2006 +0000

    Bug bpo-1353433: be conciliant with spaces in redirect URLs
    

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants