Message 224500 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Mike.Lissner
Recipients	Mike.Lissner
Date	2014-08-01.13:38:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1406900329.8.0.528379735401.issue22118@psf.upfronthosting.co.za>
In-reply-to

Content
Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed. I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm Unfortunately, most of the URLs in the HTML are relative, taking the form: '../../some/path/to/some/pdf.pdf' I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like: https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense. It works because those clients fix the problem, joining the invalid path and the URL into: https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do. I've never filed a Python bugs before, but is this something we could consider?

Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed.

I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm

Unfortunately, most of the URLs in the HTML are relative, taking the form:

'../../some/path/to/some/pdf.pdf'

I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like:

https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf

If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense. 

**It works because those clients fix the problem,** joining the invalid path and the URL into:

https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf

I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do. 

I've never filed a Python bugs before, but is this something we could consider?

History
Date	User	Action	Args
2014-08-01 13:38:49	Mike.Lissner	set	recipients: + Mike.Lissner
2014-08-01 13:38:49	Mike.Lissner	set	messageid: <1406900329.8.0.528379735401.issue22118@psf.upfronthosting.co.za>
2014-08-01 13:38:49	Mike.Lissner	link	issue22118 messages
2014-08-01 13:38:49	Mike.Lissner	create