classification
Title: urljoin fails with messy relative URLs
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Mike.Lissner, demian.brecht, ezio.melotti, ncoghlan, orsenthil, pitrou, python-dev, scoder
Priority: normal Keywords: easy, patch

Created on 2014-08-01 13:38 by Mike.Lissner, last changed 2014-09-22 07:49 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
issue22118.patch demian.brecht, 2014-08-06 00:45 review
issue22118_1.patch demian.brecht, 2014-08-07 03:59 review
issue22118_2.patch demian.brecht, 2014-08-11 14:09 review
Messages (21)
msg224500 - (view) Author: Mike Lissner (Mike.Lissner) Date: 2014-08-01 13:38
Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed.

I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm

Unfortunately, most of the URLs in the HTML are relative, taking the form:

'../../some/path/to/some/pdf.pdf'

I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like:

https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf

If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense. 

**It works because those clients fix the problem,** joining the invalid path and the URL into:

https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf

I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do. 

I've never filed a Python bugs before, but is this something we could consider?
msg224503 - (view) Author: Mike Lissner (Mike.Lissner) Date: 2014-08-01 13:59
FWIW, the workaround that I've just created for this problem is this:

u = 'https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf'
# Split the url and rejoin it, nuking any '/..' patterns at the
# beginning of the path.
s = urlsplit(u)
urlunsplit(s[:2] + (re.sub('^(/\.\.)+', '', s.path),) + s[3:])
msg224864 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-08-05 17:52
The thing is, urljoin() isn't HTTP-specific, and such URLs *may* make sense for other protocols.
msg224870 - (view) Author: Mike Lissner (Mike.Lissner) Date: 2014-08-05 18:31
@pitrou, I haven't delved into URLs in a long while, but the general idea is:

scheme://domain:port/path?query_string#fragment_id

When would it ever make sense to have something up a level from the root of the domain?
msg224872 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-08-05 18:38
Actually, according to RFC 3986, it seems you are right and we should remove extraneous leading "/../" and "/./" components in the path.
http://tools.ietf.org/html/rfc3986#section-5.4
msg224876 - (view) Author: Demian Brecht (demian.brecht) * Date: 2014-08-05 18:58
I've only had a couple minutes to look into this so far, but the bug does indeed seem to be valid across all versions. In fact, the line "# XXX The stuff below is bogus in various ways..." in urljoin tipped me off to something potentially being askew ;)

Unless the OP wants to provide a patch, I was going to take a look at possibly updating urljoin (at least the part below that line) to follow http://tools.ietf.org/html/rfc3986#section-5.2 a little more closely.
msg224877 - (view) Author: Mike Lissner (Mike.Lissner) Date: 2014-08-05 18:59
@demian.brecht, that'd make me very pleased if you took this over. I won't have time to devote to it, unfortunately.
msg224893 - (view) Author: Demian Brecht (demian.brecht) * Date: 2014-08-06 00:45
Here's an initial patch with a fix that passes all the test cases other than the ones that are incorrect based on examples and pseudocode in RFC 3986. I haven't checked obsoleted RFCs yet to ensure consistency, but will do so when I get a chance (likely tomorrow morning). Posting this now to see if anyone's opposed to the change, or at least the direction of.

Note: It /does/ break backwards compatibility, but it seems that previous logic was incorrect (based on my upcoming checking for consistency between RFCs). As such, I'm not sure that it should be fixed < 3.5. Thoughts?
msg224896 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-08-06 02:51
> It /does/ break backwards compatibility, but it seems that previous
> logic was incorrect (based on my upcoming checking for consistency
> between RFCs). As such, I'm not sure that it should be fixed < 3.5.
> Thoughts?

Actually, the logic seems to be correct according to RFC 1808 (which the variable names in the tests seem to hint at, as well). I think it's fine to upgrade the semantics to newer RFCs, but we should only do it in 3.5 indeed.
msg224984 - (view) Author: Demian Brecht (demian.brecht) * Date: 2014-08-07 03:59
Update based on review comments, added legacy support, fixed the tests up.
msg225144 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-08-10 15:50
Demian, thanks for the update! I'm not sure the rfc1808 flag is necessary. I would be fine with switching wholesale to the new semantics. Nick, since you've done work in the past on URIs, what do you think?
msg225156 - (view) Author: Demian Brecht (demian.brecht) * Date: 2014-08-10 18:12
FWIW, I think that it would be ever so slightly preferable to maintain support for the old behaviour but perhaps add a deprecation note in the docs. It's a little difficult to tell how much of an impact this change will have to existing code. Maintaining support for the old behaviour for 3.5 would make the change a little more easy to deal with.

That said, it's nothing that I'd argue heatedly over.
msg225173 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-10 23:38
Unfortunately, I haven't looked at the RFC compliance side of things in
this space in years. At the time, we were considering a new module, leaving
the old one alone.

These days, I'd be more inclined to just fix it for 3.5, and suggest anyone
really needing the old behaviour fork an older version of the module.

Extracting the test case inputs from that old "uriparse" rewrite might
still be a good idea, though.
msg225186 - (view) Author: Demian Brecht (demian.brecht) * Date: 2014-08-11 14:09
Uploaded new patch. Removed support for RFC1808-specific behaviour. Extracted non-compliant tests into comment blocks indicating the behaviour is no longer supported.
msg225187 - (view) Author: Mike Lissner (Mike.Lissner) Date: 2014-08-11 14:14
Just hopping in here to say that the work going down here is beautiful. I've filed a lot of bugs. This one's not particularly difficult, but damn, I appreciate the speed and quality going into fixing it. 

Glad to see the Python language is a happy place with fast, quality fixes.
msg225191 - (view) Author: Demian Brecht (demian.brecht) * Date: 2014-08-11 14:48
Thanks Mike, it's always nice to get positive feedback :)
msg225619 - (view) Author: Roundup Robot (python-dev) Date: 2014-08-21 23:16
New changeset b116489d31ff by Antoine Pitrou in branch 'default':
Issue #22118: Switch urllib.parse to use RFC 3986 semantics for the resolution of relative URLs, rather than RFCs 1808 and 2396.
http://hg.python.org/cpython/rev/b116489d31ff
msg225620 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-08-21 23:17
The patch is now committed to the future Python 3.5. Thank you very much for this contribution!
msg225661 - (view) Author: Stefan Behnel (scoder) * Date: 2014-08-22 11:43
I'm now getting duplicated slashes in URLs, e.g.:

https://new//foo.html
http://my.little.server/url//logo.gif

In both cases, the base URL that gets joined with the postfix had a trailing slash, e.g.

"http://my.little.server/url/" + "logo.gif" -> "http://my.little.server/url//logo.gif"
msg225662 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-08-22 12:15
Issue #1500504 (the "urischemes" proposal that never got turned into a PyPI package) has several additional test cases. This particular attachment is one AMK cleaned up to run on Py3k:

http://bugs.python.org/file32591/urischemes.py

The module itself likely isn't worth salvaging, but the additional examples in the test cases were very useful in flushing out various RFC compliance issues.
msg227253 - (view) Author: Roundup Robot (python-dev) Date: 2014-09-22 07:49
New changeset 901e4e52b20a by Senthil Kumaran in branch 'default':
Issue #22278: Fix urljoin problem with relative urls, a regression observed
https://hg.python.org/cpython/rev/901e4e52b20a
History
Date User Action Args
2014-09-22 07:49:30python-devsetmessages: + msg227253
2014-08-22 12:15:47ncoghlansetmessages: + msg225662
2014-08-22 11:43:50scodersetnosy: + scoder
messages: + msg225661
2014-08-21 23:17:52pitrousetstatus: open -> closed
resolution: fixed
messages: + msg225620

stage: patch review -> resolved
2014-08-21 23:16:47python-devsetnosy: + python-dev
messages: + msg225619
2014-08-11 14:49:00demian.brechtsetmessages: + msg225191
2014-08-11 14:14:34Mike.Lissnersetmessages: + msg225187
2014-08-11 14:10:00demian.brechtsetfiles: + issue22118_2.patch

messages: + msg225186
2014-08-10 23:38:13ncoghlansetmessages: + msg225173
2014-08-10 18:12:39demian.brechtsetmessages: + msg225156
2014-08-10 15:50:56pitrousetmessages: + msg225144
2014-08-07 03:59:31demian.brechtsetfiles: + issue22118_1.patch

messages: + msg224984
2014-08-06 02:51:24pitrousetstage: needs patch -> patch review
messages: + msg224896
versions: - Python 2.7, Python 3.4
2014-08-06 00:45:10demian.brechtsetfiles: + issue22118.patch
keywords: + patch
messages: + msg224893
2014-08-05 18:59:45Mike.Lissnersetmessages: + msg224877
2014-08-05 18:58:23demian.brechtsetmessages: + msg224876
2014-08-05 18:40:24pitrousetversions: + Python 3.4, Python 3.5
2014-08-05 18:38:20pitrousetnosy: + ncoghlan
messages: + msg224872
2014-08-05 18:31:26Mike.Lissnersetmessages: + msg224870
2014-08-05 17:52:50pitrousetnosy: + pitrou
messages: + msg224864
2014-08-05 17:49:34demian.brechtsetnosy: + demian.brecht
2014-08-05 05:41:46ezio.melottisetkeywords: + easy
nosy: + orsenthil, ezio.melotti

stage: needs patch
2014-08-01 13:59:11Mike.Lissnersetmessages: + msg224503
2014-08-01 13:38:49Mike.Lissnercreate