Message 118180 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	eric.araujo, eric.smith, ncoghlan, orsenthil, pitrou, r.david.murray, vstinner
Date	2010-10-08.11:10:39
SpamBayes Score	5.224982e-08
Marked as misclassified	No
Message-id	<1286536242.79.0.27722855036.issue9873@psf.upfronthosting.co.za>
In-reply-to

Content
I've been pondering the idea of adopting a more conservative approach here, since there are actually two issues: 1. Properly quoted URLs are transferred as pure 7-bit ASCII (due to percent-encoding of everything else). However, most of the manipulation functions in urllib.parse can't handle bytes at all, even data that is 7-bit clean. 2. In the real world, just like email, URLs will often contain unescaped (or incorrectly escaped) characters. So assuming the input is actually pure ASCII isn't necessarily a valid assumption. I'm wondering, since encoding (aside from quoting) isn't urllib.parse's problem, maybe what I should be looking at doing is just handling bytes input via an implicit ascii conversion in strict mode (and then conversion back when the processing is complete). Then bytes manipulation of properly quoted URLs will "just work", while improperly quoted URLs will fail noisily. This isn't like email or http where the protocol contains encoding information that the library should be trying to interpret - we're just being given raw bytes without any context information. If any application wants to be more permissive than that, it can do its own conversion to a string and then use the text-based processing. I'll add "encode" methods to the result objects to make it easy to convert their contents from str to bytes and vice-versa. I'll factor out the implicit encoding/decoding such that if we decide to change the model later (ASCII-strict, ASCII-escape, latin-1) it shouldn't be too difficult.

I've been pondering the idea of adopting a more conservative approach here, since there are actually two issues:

1. Properly quoted URLs are transferred as pure 7-bit ASCII (due to percent-encoding of everything else). However, most of the manipulation functions in urllib.parse can't handle bytes at all, even data that is 7-bit clean.

2. In the real world, just like email, URLs will often contain unescaped (or incorrectly escaped) characters. So assuming the input is actually pure ASCII isn't necessarily a valid assumption.

I'm wondering, since encoding (aside from quoting) isn't urllib.parse's problem, maybe what I should be looking at doing is just handling bytes input via an implicit ascii conversion in strict mode (and then conversion back when the processing is complete).

Then bytes manipulation of properly quoted URLs will "just work", while improperly quoted URLs will fail noisily. This isn't like email or http where the protocol contains encoding information that the library should be trying to interpret - we're just being given raw bytes without any context information.

If any application wants to be more permissive than that, it can do its own conversion to a string and then use the text-based processing. I'll add "encode" methods to the result objects to make it easy to convert their contents from str to bytes and vice-versa.

I'll factor out the implicit encoding/decoding such that if we decide to change the model later (ASCII-strict, ASCII-escape, latin-1) it shouldn't be too difficult.

History
Date	User	Action	Args
2010-10-08 11:10:42	ncoghlan	set	recipients: + ncoghlan, orsenthil, pitrou, vstinner, eric.smith, eric.araujo, r.david.murray
2010-10-08 11:10:42	ncoghlan	set	messageid: <1286536242.79.0.27722855036.issue9873@psf.upfronthosting.co.za>
2010-10-08 11:10:41	ncoghlan	link	issue9873 messages
2010-10-08 11:10:39	ncoghlan	create