msg224500 - (view) |
Author: Mike Lissner (Mike.Lissner) |
Date: 2014-08-01 13:38 |
Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed.
I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm
Unfortunately, most of the URLs in the HTML are relative, taking the form:
'../../some/path/to/some/pdf.pdf'
I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like:
https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf
If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense.
**It works because those clients fix the problem,** joining the invalid path and the URL into:
https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf
I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do.
I've never filed a Python bugs before, but is this something we could consider?
|
msg224503 - (view) |
Author: Mike Lissner (Mike.Lissner) |
Date: 2014-08-01 13:59 |
FWIW, the workaround that I've just created for this problem is this:
u = 'https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf'
# Split the url and rejoin it, nuking any '/..' patterns at the
# beginning of the path.
s = urlsplit(u)
urlunsplit(s[:2] + (re.sub('^(/\.\.)+', '', s.path),) + s[3:])
|
msg224864 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2014-08-05 17:52 |
The thing is, urljoin() isn't HTTP-specific, and such URLs *may* make sense for other protocols.
|
msg224870 - (view) |
Author: Mike Lissner (Mike.Lissner) |
Date: 2014-08-05 18:31 |
@pitrou, I haven't delved into URLs in a long while, but the general idea is:
scheme://domain:port/path?query_string#fragment_id
When would it ever make sense to have something up a level from the root of the domain?
|
msg224872 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2014-08-05 18:38 |
Actually, according to RFC 3986, it seems you are right and we should remove extraneous leading "/../" and "/./" components in the path.
http://tools.ietf.org/html/rfc3986#section-5.4
|
msg224876 - (view) |
Author: Demian Brecht (demian.brecht) * |
Date: 2014-08-05 18:58 |
I've only had a couple minutes to look into this so far, but the bug does indeed seem to be valid across all versions. In fact, the line "# XXX The stuff below is bogus in various ways..." in urljoin tipped me off to something potentially being askew ;)
Unless the OP wants to provide a patch, I was going to take a look at possibly updating urljoin (at least the part below that line) to follow http://tools.ietf.org/html/rfc3986#section-5.2 a little more closely.
|
msg224877 - (view) |
Author: Mike Lissner (Mike.Lissner) |
Date: 2014-08-05 18:59 |
@demian.brecht, that'd make me very pleased if you took this over. I won't have time to devote to it, unfortunately.
|
msg224893 - (view) |
Author: Demian Brecht (demian.brecht) * |
Date: 2014-08-06 00:45 |
Here's an initial patch with a fix that passes all the test cases other than the ones that are incorrect based on examples and pseudocode in RFC 3986. I haven't checked obsoleted RFCs yet to ensure consistency, but will do so when I get a chance (likely tomorrow morning). Posting this now to see if anyone's opposed to the change, or at least the direction of.
Note: It /does/ break backwards compatibility, but it seems that previous logic was incorrect (based on my upcoming checking for consistency between RFCs). As such, I'm not sure that it should be fixed < 3.5. Thoughts?
|
msg224896 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2014-08-06 02:51 |
> It /does/ break backwards compatibility, but it seems that previous
> logic was incorrect (based on my upcoming checking for consistency
> between RFCs). As such, I'm not sure that it should be fixed < 3.5.
> Thoughts?
Actually, the logic seems to be correct according to RFC 1808 (which the variable names in the tests seem to hint at, as well). I think it's fine to upgrade the semantics to newer RFCs, but we should only do it in 3.5 indeed.
|
msg224984 - (view) |
Author: Demian Brecht (demian.brecht) * |
Date: 2014-08-07 03:59 |
Update based on review comments, added legacy support, fixed the tests up.
|
msg225144 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2014-08-10 15:50 |
Demian, thanks for the update! I'm not sure the rfc1808 flag is necessary. I would be fine with switching wholesale to the new semantics. Nick, since you've done work in the past on URIs, what do you think?
|
msg225156 - (view) |
Author: Demian Brecht (demian.brecht) * |
Date: 2014-08-10 18:12 |
FWIW, I think that it would be ever so slightly preferable to maintain support for the old behaviour but perhaps add a deprecation note in the docs. It's a little difficult to tell how much of an impact this change will have to existing code. Maintaining support for the old behaviour for 3.5 would make the change a little more easy to deal with.
That said, it's nothing that I'd argue heatedly over.
|
msg225173 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2014-08-10 23:38 |
Unfortunately, I haven't looked at the RFC compliance side of things in
this space in years. At the time, we were considering a new module, leaving
the old one alone.
These days, I'd be more inclined to just fix it for 3.5, and suggest anyone
really needing the old behaviour fork an older version of the module.
Extracting the test case inputs from that old "uriparse" rewrite might
still be a good idea, though.
|
msg225186 - (view) |
Author: Demian Brecht (demian.brecht) * |
Date: 2014-08-11 14:09 |
Uploaded new patch. Removed support for RFC1808-specific behaviour. Extracted non-compliant tests into comment blocks indicating the behaviour is no longer supported.
|
msg225187 - (view) |
Author: Mike Lissner (Mike.Lissner) |
Date: 2014-08-11 14:14 |
Just hopping in here to say that the work going down here is beautiful. I've filed a lot of bugs. This one's not particularly difficult, but damn, I appreciate the speed and quality going into fixing it.
Glad to see the Python language is a happy place with fast, quality fixes.
|
msg225191 - (view) |
Author: Demian Brecht (demian.brecht) * |
Date: 2014-08-11 14:48 |
Thanks Mike, it's always nice to get positive feedback :)
|
msg225619 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2014-08-21 23:16 |
New changeset b116489d31ff by Antoine Pitrou in branch 'default':
Issue #22118: Switch urllib.parse to use RFC 3986 semantics for the resolution of relative URLs, rather than RFCs 1808 and 2396.
http://hg.python.org/cpython/rev/b116489d31ff
|
msg225620 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2014-08-21 23:17 |
The patch is now committed to the future Python 3.5. Thank you very much for this contribution!
|
msg225661 - (view) |
Author: Stefan Behnel (scoder) * |
Date: 2014-08-22 11:43 |
I'm now getting duplicated slashes in URLs, e.g.:
https://new//foo.html
http://my.little.server/url//logo.gif
In both cases, the base URL that gets joined with the postfix had a trailing slash, e.g.
"http://my.little.server/url/" + "logo.gif" -> "http://my.little.server/url//logo.gif"
|
msg225662 - (view) |
Author: Alyssa Coghlan (ncoghlan) * |
Date: 2014-08-22 12:15 |
Issue #1500504 (the "urischemes" proposal that never got turned into a PyPI package) has several additional test cases. This particular attachment is one AMK cleaned up to run on Py3k:
http://bugs.python.org/file32591/urischemes.py
The module itself likely isn't worth salvaging, but the additional examples in the test cases were very useful in flushing out various RFC compliance issues.
|
msg227253 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2014-09-22 07:49 |
New changeset 901e4e52b20a by Senthil Kumaran in branch 'default':
Issue #22278: Fix urljoin problem with relative urls, a regression observed
https://hg.python.org/cpython/rev/901e4e52b20a
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:06 | admin | set | github: 66316 |
2014-09-22 07:49:30 | python-dev | set | messages:
+ msg227253 |
2014-08-22 12:15:47 | ncoghlan | set | messages:
+ msg225662 |
2014-08-22 11:43:50 | scoder | set | nosy:
+ scoder messages:
+ msg225661
|
2014-08-21 23:17:52 | pitrou | set | status: open -> closed resolution: fixed messages:
+ msg225620
stage: patch review -> resolved |
2014-08-21 23:16:47 | python-dev | set | nosy:
+ python-dev messages:
+ msg225619
|
2014-08-11 14:49:00 | demian.brecht | set | messages:
+ msg225191 |
2014-08-11 14:14:34 | Mike.Lissner | set | messages:
+ msg225187 |
2014-08-11 14:10:00 | demian.brecht | set | files:
+ issue22118_2.patch
messages:
+ msg225186 |
2014-08-10 23:38:13 | ncoghlan | set | messages:
+ msg225173 |
2014-08-10 18:12:39 | demian.brecht | set | messages:
+ msg225156 |
2014-08-10 15:50:56 | pitrou | set | messages:
+ msg225144 |
2014-08-07 03:59:31 | demian.brecht | set | files:
+ issue22118_1.patch
messages:
+ msg224984 |
2014-08-06 02:51:24 | pitrou | set | stage: needs patch -> patch review messages:
+ msg224896 versions:
- Python 2.7, Python 3.4 |
2014-08-06 00:45:10 | demian.brecht | set | files:
+ issue22118.patch keywords:
+ patch messages:
+ msg224893
|
2014-08-05 18:59:45 | Mike.Lissner | set | messages:
+ msg224877 |
2014-08-05 18:58:23 | demian.brecht | set | messages:
+ msg224876 |
2014-08-05 18:40:24 | pitrou | set | versions:
+ Python 3.4, Python 3.5 |
2014-08-05 18:38:20 | pitrou | set | nosy:
+ ncoghlan messages:
+ msg224872
|
2014-08-05 18:31:26 | Mike.Lissner | set | messages:
+ msg224870 |
2014-08-05 17:52:50 | pitrou | set | nosy:
+ pitrou messages:
+ msg224864
|
2014-08-05 17:49:34 | demian.brecht | set | nosy:
+ demian.brecht
|
2014-08-05 05:41:46 | ezio.melotti | set | keywords:
+ easy nosy:
+ orsenthil, ezio.melotti
stage: needs patch |
2014-08-01 13:59:11 | Mike.Lissner | set | messages:
+ msg224503 |
2014-08-01 13:38:49 | Mike.Lissner | create | |