Author janssen
Recipients gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date 2008-08-08.01:34:03
SpamBayes Score 1.8547e-07
Marked as misclassified No
Message-id <1218159245.24.0.0467802626348.issue3300@psf.upfronthosting.co.za>
In-reply-to
Content
Now I'm looking at the failing test_http_cookiejar test, which fails
because it encodes a non-UTF-8 byte, 0xE5, in a path segment of a URI.
The question is, does the "http" URI scheme allow non-ASCII (say,
Latin-1) octets in path segments?  IANA says that the "http" scheme
is defined in RFC 2616, and that says:

   This specification adopts the
   definitions of "URI-reference", "absoluteURI", "relativeURI", "port",
   "host","abs_path", "rel_path", and "authority" from [RFC 2396].

But RFC 2396 says:

    An individual URI scheme may require a single charset, define a
    default charset, or provide a way to indicate the charset used.

And doesn't say anything about the "http" scheme.  Nor does it indicate
any default encoding or character set for URIs.  The update, 3986,
doesn't say anything new about this, though it does implore URI scheme
designers to represent characters in a textual segment with ASCII codes
where they exist, and to use UTF-8 when designing *new* URI schemes.

Barring any other information, I think that the "segments" in the path
of an "http" URL must also be assumed to be binary; that is, any octet
is allowed, and no character set can be presumed.
History
Date User Action Args
2008-08-08 01:34:05janssensetrecipients: + janssen, lemburg, gvanrossum, loewis, jimjjewett, orsenthil, pitrou, thomaspinckney3, mgiuca
2008-08-08 01:34:05janssensetmessageid: <1218159245.24.0.0467802626348.issue3300@psf.upfronthosting.co.za>
2008-08-08 01:34:04janssenlinkissue3300 messages
2008-08-08 01:34:03janssencreate