Author ncoghlan
Recipients eric.araujo, eric.smith, ncoghlan, orsenthil, pitrou, r.david.murray, vstinner
Date 2010-10-08.11:10:39
SpamBayes Score 5.22498e-08
Marked as misclassified No
Message-id <>
I've been pondering the idea of adopting a more conservative approach here, since there are actually two issues:

1. Properly quoted URLs are transferred as pure 7-bit ASCII (due to percent-encoding of everything else). However, most of the manipulation functions in urllib.parse can't handle bytes at all, even data that is 7-bit clean.

2. In the real world, just like email, URLs will often contain unescaped (or incorrectly escaped) characters. So assuming the input is actually pure ASCII isn't necessarily a valid assumption.

I'm wondering, since encoding (aside from quoting) isn't urllib.parse's problem, maybe what I should be looking at doing is just handling bytes input via an implicit ascii conversion in strict mode (and then conversion back when the processing is complete).

Then bytes manipulation of properly quoted URLs will "just work", while improperly quoted URLs will fail noisily. This isn't like email or http where the protocol contains encoding information that the library should be trying to interpret - we're just being given raw bytes without any context information.

If any application wants to be more permissive than that, it can do its own conversion to a string and then use the text-based processing. I'll add "encode" methods to the result objects to make it easy to convert their contents from str to bytes and vice-versa.

I'll factor out the implicit encoding/decoding such that if we decide to change the model later (ASCII-strict, ASCII-escape, latin-1) it shouldn't be too difficult.
Date User Action Args
2010-10-08 11:10:42ncoghlansetrecipients: + ncoghlan, orsenthil, pitrou, vstinner, eric.smith, eric.araujo, r.david.murray
2010-10-08 11:10:42ncoghlansetmessageid: <>
2010-10-08 11:10:41ncoghlanlinkissue9873 messages
2010-10-08 11:10:39ncoghlancreate