Author Mike.Lissner
Recipients Mike.Lissner
Date 2014-08-01.13:38:49
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed.

I'm trying to crawl this website:

Unfortunately, most of the URLs in the HTML are relative, taking the form:


I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like:

If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense. 

**It works because those clients fix the problem,** joining the invalid path and the URL into:

I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do. 

I've never filed a Python bugs before, but is this something we could consider?
Date User Action Args
2014-08-01 13:38:49Mike.Lissnersetrecipients: + Mike.Lissner
2014-08-01 13:38:49Mike.Lissnersetmessageid: <>
2014-08-01 13:38:49Mike.Lissnerlinkissue22118 messages
2014-08-01 13:38:49Mike.Lissnercreate