Message 70955 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jimjjewett
Recipients	gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-10.00:35:08
SpamBayes Score	3.3100607e-06
Marked as misclassified	No
Message-id	<1218328512.04.0.38331607788.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
Matt, Bill's main concern is with a policy decision; I doubt he would object to using your code once that is resolved. The purpose of the quoting functions is to turn a string (representing the human-readable version) into bytes (that go over the wire). If everything is ASCII, there isn't any disagreement -- but it also isn't obvious that they're bytes instead of characters. So people started (well, continued, since it dates to pre-unicode C) treating them as though they were strings. The fact that ASCII (and therefore most wire protocols) looks the same as bytes or as characters was one of the strongest arguments against splitting the bytes and string types. Now that this has been done, Bill feels we should be consistent. (You feel wire-protocol bytes should be treated as strings, if only as bytestrings, because the libraries use them that way -- but this is a policy decision.) To quote the final paragraph of 1.2.1 """ In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. """ So the mapping to bytes (or "octets") for non-ASCII isn't defined (here), and if you want to use it, you need to specify charset. But in practice, people do use it without specifying a charset. Which charset should be assumed? The old code (and test cases) assumed Latin-1. You want to assume UTF-8 (though you took the document charset when available -- which might also make sense).

Matt,

Bill's main concern is with a policy decision; I doubt he would object to 
using your code once that is resolved.

The purpose of the quoting functions is to turn a string (representing the 
human-readable version) into bytes (that go over the wire).  If everything 
is ASCII, there isn't any disagreement -- but it also isn't obvious that 
they're bytes instead of characters.  So people started (well, continued, 
since it dates to pre-unicode C) treating them as though they were strings.

The fact that ASCII (and therefore most wire protocols) looks the same as 
bytes or as characters was one of the strongest arguments against splitting 
the bytes and string types.  Now that this has been done, Bill feels we 
should be consistent.  (You feel wire-protocol bytes should be treated as 
strings, if only as bytestrings, because the libraries use them that way -- 
but this is a policy decision.)

To quote the final paragraph of 1.2.1
"""
 In local or regional contexts and with improving technology, users
   might benefit from being able to use a wider range of characters;
   such use is not defined by this specification.  Percent-encoded
   octets (Section 2.1) may be used within a URI to represent characters
   outside the range of the US-ASCII coded character set if this
   representation is allowed by the scheme or by the protocol element in
   which the URI is referenced.  Such a definition should specify the
   character encoding used to map those characters to octets prior to
   being percent-encoded for the URI.
"""

So the mapping to bytes (or "octets") for non-ASCII isn't defined (here), 
and if you want to use it, you need to specify charset.  But in practice, 
people do use it without specifying a charset.  Which charset should be 
assumed?  The old code (and test cases) assumed Latin-1.  You want to 
assume UTF-8 (though you took the document charset when available -- which 
might also make sense).

History
Date	User	Action	Args
2008-08-10 00:35:12	jimjjewett	set	recipients: + jimjjewett, lemburg, gvanrossum, loewis, janssen, orsenthil, pitrou, thomaspinckney3, mgiuca
2008-08-10 00:35:12	jimjjewett	set	messageid: <1218328512.04.0.38331607788.issue3300@psf.upfronthosting.co.za>
2008-08-10 00:35:10	jimjjewett	link	issue3300 messages
2008-08-10 00:35:08	jimjjewett	create