Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urljoin fails with messy relative URLs #66316

Closed
mlissner mannequin opened this issue Aug 1, 2014 · 21 comments
Closed

urljoin fails with messy relative URLs #66316

mlissner mannequin opened this issue Aug 1, 2014 · 21 comments
Labels
easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@mlissner
Copy link
Mannequin

mlissner mannequin commented Aug 1, 2014

BPO 22118
Nosy @ncoghlan, @orsenthil, @pitrou, @scoder, @ezio-melotti, @demianbrecht, @mlissner
Files
  • issue22118.patch
  • issue22118_1.patch
  • issue22118_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-08-21.23:17:52.022>
    created_at = <Date 2014-08-01.13:38:49.756>
    labels = ['easy', 'type-bug', 'library']
    title = 'urljoin fails with messy relative URLs'
    updated_at = <Date 2014-09-22.07:49:30.042>
    user = 'https://github.com/mlissner'

    bugs.python.org fields:

    activity = <Date 2014-09-22.07:49:30.042>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-08-21.23:17:52.022>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2014-08-01.13:38:49.756>
    creator = 'Mike.Lissner'
    dependencies = []
    files = ['36278', '36296', '36347']
    hgrepos = []
    issue_num = 22118
    keywords = ['patch', 'easy']
    message_count = 21.0
    messages = ['224500', '224503', '224864', '224870', '224872', '224876', '224877', '224893', '224896', '224984', '225144', '225156', '225173', '225186', '225187', '225191', '225619', '225620', '225661', '225662', '227253']
    nosy_count = 8.0
    nosy_names = ['ncoghlan', 'orsenthil', 'pitrou', 'scoder', 'ezio.melotti', 'python-dev', 'demian.brecht', 'Mike.Lissner']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue22118'
    versions = ['Python 3.5']

    @mlissner
    Copy link
    Mannequin Author

    mlissner mannequin commented Aug 1, 2014

    Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed.

    I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm

    Unfortunately, most of the URLs in the HTML are relative, taking the form:

    '../../some/path/to/some/pdf.pdf'

    I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like:

    https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf

    If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense.

    **It works because those clients fix the problem,** joining the invalid path and the URL into:

    https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf

    I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do.

    I've never filed a Python bugs before, but is this something we could consider?

    @mlissner mlissner mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Aug 1, 2014
    @mlissner
    Copy link
    Mannequin Author

    mlissner mannequin commented Aug 1, 2014

    FWIW, the workaround that I've just created for this problem is this:

    u = 'https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf'
    # Split the url and rejoin it, nuking any '/..' patterns at the
    # beginning of the path.
    s = urlsplit(u)
    urlunsplit(s[:2] + (re.sub('^(/\.\.)+', '', s.path),) + s[3:])

    @pitrou
    Copy link
    Member

    pitrou commented Aug 5, 2014

    The thing is, urljoin() isn't HTTP-specific, and such URLs *may* make sense for other protocols.

    @mlissner
    Copy link
    Mannequin Author

    mlissner mannequin commented Aug 5, 2014

    @pitrou, I haven't delved into URLs in a long while, but the general idea is:

    scheme://domain:port/path?query_string#fragment_id

    When would it ever make sense to have something up a level from the root of the domain?

    @pitrou
    Copy link
    Member

    pitrou commented Aug 5, 2014

    Actually, according to RFC 3986, it seems you are right and we should remove extraneous leading "/../" and "/./" components in the path.
    http://tools.ietf.org/html/rfc3986#section-5.4

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Aug 5, 2014

    I've only had a couple minutes to look into this so far, but the bug does indeed seem to be valid across all versions. In fact, the line "# XXX The stuff below is bogus in various ways..." in urljoin tipped me off to something potentially being askew ;)

    Unless the OP wants to provide a patch, I was going to take a look at possibly updating urljoin (at least the part below that line) to follow http://tools.ietf.org/html/rfc3986#section-5.2 a little more closely.

    @mlissner
    Copy link
    Mannequin Author

    mlissner mannequin commented Aug 5, 2014

    @demian.brecht, that'd make me very pleased if you took this over. I won't have time to devote to it, unfortunately.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Aug 6, 2014

    Here's an initial patch with a fix that passes all the test cases other than the ones that are incorrect based on examples and pseudocode in RFC 3986. I haven't checked obsoleted RFCs yet to ensure consistency, but will do so when I get a chance (likely tomorrow morning). Posting this now to see if anyone's opposed to the change, or at least the direction of.

    Note: It /does/ break backwards compatibility, but it seems that previous logic was incorrect (based on my upcoming checking for consistency between RFCs). As such, I'm not sure that it should be fixed < 3.5. Thoughts?

    @pitrou
    Copy link
    Member

    pitrou commented Aug 6, 2014

    It /does/ break backwards compatibility, but it seems that previous
    logic was incorrect (based on my upcoming checking for consistency
    between RFCs). As such, I'm not sure that it should be fixed < 3.5.
    Thoughts?

    Actually, the logic seems to be correct according to RFC 1808 (which the variable names in the tests seem to hint at, as well). I think it's fine to upgrade the semantics to newer RFCs, but we should only do it in 3.5 indeed.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Aug 7, 2014

    Update based on review comments, added legacy support, fixed the tests up.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 10, 2014

    Demian, thanks for the update! I'm not sure the rfc1808 flag is necessary. I would be fine with switching wholesale to the new semantics. Nick, since you've done work in the past on URIs, what do you think?

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Aug 10, 2014

    FWIW, I think that it would be ever so slightly preferable to maintain support for the old behaviour but perhaps add a deprecation note in the docs. It's a little difficult to tell how much of an impact this change will have to existing code. Maintaining support for the old behaviour for 3.5 would make the change a little more easy to deal with.

    That said, it's nothing that I'd argue heatedly over.

    @ncoghlan
    Copy link
    Contributor

    Unfortunately, I haven't looked at the RFC compliance side of things in
    this space in years. At the time, we were considering a new module, leaving
    the old one alone.

    These days, I'd be more inclined to just fix it for 3.5, and suggest anyone
    really needing the old behaviour fork an older version of the module.

    Extracting the test case inputs from that old "uriparse" rewrite might
    still be a good idea, though.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Aug 11, 2014

    Uploaded new patch. Removed support for RFC1808-specific behaviour. Extracted non-compliant tests into comment blocks indicating the behaviour is no longer supported.

    @mlissner
    Copy link
    Mannequin Author

    mlissner mannequin commented Aug 11, 2014

    Just hopping in here to say that the work going down here is beautiful. I've filed a lot of bugs. This one's not particularly difficult, but damn, I appreciate the speed and quality going into fixing it.

    Glad to see the Python language is a happy place with fast, quality fixes.

    @demianbrecht
    Copy link
    Mannequin

    demianbrecht mannequin commented Aug 11, 2014

    Thanks Mike, it's always nice to get positive feedback :)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 21, 2014

    New changeset b116489d31ff by Antoine Pitrou in branch 'default':
    Issue bpo-22118: Switch urllib.parse to use RFC 3986 semantics for the resolution of relative URLs, rather than RFCs 1808 and 2396.
    http://hg.python.org/cpython/rev/b116489d31ff

    @pitrou
    Copy link
    Member

    pitrou commented Aug 21, 2014

    The patch is now committed to the future Python 3.5. Thank you very much for this contribution!

    @pitrou pitrou closed this as completed Aug 21, 2014
    @scoder
    Copy link
    Contributor

    scoder commented Aug 22, 2014

    I'm now getting duplicated slashes in URLs, e.g.:

    https://new//foo.html
    http://my.little.server/url//logo.gif

    In both cases, the base URL that gets joined with the postfix had a trailing slash, e.g.

    "http://my.little.server/url/" + "logo.gif" -> "http://my.little.server/url//logo.gif"

    @ncoghlan
    Copy link
    Contributor

    Issue bpo-1500504 (the "urischemes" proposal that never got turned into a PyPI package) has several additional test cases. This particular attachment is one AMK cleaned up to run on Py3k:

    http://bugs.python.org/file32591/urischemes.py

    The module itself likely isn't worth salvaging, but the additional examples in the test cases were very useful in flushing out various RFC compliance issues.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 22, 2014

    New changeset 901e4e52b20a by Senthil Kumaran in branch 'default':
    Issue bpo-22278: Fix urljoin problem with relative urls, a regression observed
    https://hg.python.org/cpython/rev/901e4e52b20a

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants