Author jimjjewett
Recipients gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date 2008-08-10.00:35:08
SpamBayes Score 3.31006e-06
Marked as misclassified No
Message-id <1218328512.04.0.38331607788.issue3300@psf.upfronthosting.co.za>
In-reply-to
Content
Matt,

Bill's main concern is with a policy decision; I doubt he would object to 
using your code once that is resolved.

The purpose of the quoting functions is to turn a string (representing the 
human-readable version) into bytes (that go over the wire).  If everything 
is ASCII, there isn't any disagreement -- but it also isn't obvious that 
they're bytes instead of characters.  So people started (well, continued, 
since it dates to pre-unicode C) treating them as though they were strings.

The fact that ASCII (and therefore most wire protocols) looks the same as 
bytes or as characters was one of the strongest arguments against splitting 
the bytes and string types.  Now that this has been done, Bill feels we 
should be consistent.  (You feel wire-protocol bytes should be treated as 
strings, if only as bytestrings, because the libraries use them that way -- 
but this is a policy decision.)

To quote the final paragraph of 1.2.1
"""
 In local or regional contexts and with improving technology, users
   might benefit from being able to use a wider range of characters;
   such use is not defined by this specification.  Percent-encoded
   octets (Section 2.1) may be used within a URI to represent characters
   outside the range of the US-ASCII coded character set if this
   representation is allowed by the scheme or by the protocol element in
   which the URI is referenced.  Such a definition should specify the
   character encoding used to map those characters to octets prior to
   being percent-encoded for the URI.
"""

So the mapping to bytes (or "octets") for non-ASCII isn't defined (here), 
and if you want to use it, you need to specify charset.  But in practice, 
people do use it without specifying a charset.  Which charset should be 
assumed?  The old code (and test cases) assumed Latin-1.  You want to 
assume UTF-8 (though you took the document charset when available -- which 
might also make sense).
History
Date User Action Args
2008-08-10 00:35:12jimjjewettsetrecipients: + jimjjewett, lemburg, gvanrossum, loewis, janssen, orsenthil, pitrou, thomaspinckney3, mgiuca
2008-08-10 00:35:12jimjjewettsetmessageid: <1218328512.04.0.38331607788.issue3300@psf.upfronthosting.co.za>
2008-08-10 00:35:10jimjjewettlinkissue3300 messages
2008-08-10 00:35:08jimjjewettcreate