Message 70878 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	janssen
Recipients	gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-08.01:34:03
SpamBayes Score	1.8546973e-07
Marked as misclassified	No
Message-id	<1218159245.24.0.0467802626348.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
Now I'm looking at the failing test_http_cookiejar test, which fails because it encodes a non-UTF-8 byte, 0xE5, in a path segment of a URI. The question is, does the "http" URI scheme allow non-ASCII (say, Latin-1) octets in path segments? IANA says that the "http" scheme is defined in RFC 2616, and that says: This specification adopts the definitions of "URI-reference", "absoluteURI", "relativeURI", "port", "host","abs_path", "rel_path", and "authority" from [RFC 2396]. But RFC 2396 says: An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used. And doesn't say anything about the "http" scheme. Nor does it indicate any default encoding or character set for URIs. The update, 3986, doesn't say anything new about this, though it does implore URI scheme designers to represent characters in a textual segment with ASCII codes where they exist, and to use UTF-8 when designing new URI schemes. Barring any other information, I think that the "segments" in the path of an "http" URL must also be assumed to be binary; that is, any octet is allowed, and no character set can be presumed.

Now I'm looking at the failing test_http_cookiejar test, which fails
because it encodes a non-UTF-8 byte, 0xE5, in a path segment of a URI.
The question is, does the "http" URI scheme allow non-ASCII (say,
Latin-1) octets in path segments?  IANA says that the "http" scheme
is defined in RFC 2616, and that says:

   This specification adopts the
   definitions of "URI-reference", "absoluteURI", "relativeURI", "port",
   "host","abs_path", "rel_path", and "authority" from [RFC 2396].

But RFC 2396 says:

    An individual URI scheme may require a single charset, define a
    default charset, or provide a way to indicate the charset used.

And doesn't say anything about the "http" scheme.  Nor does it indicate
any default encoding or character set for URIs.  The update, 3986,
doesn't say anything new about this, though it does implore URI scheme
designers to represent characters in a textual segment with ASCII codes
where they exist, and to use UTF-8 when designing *new* URI schemes.

Barring any other information, I think that the "segments" in the path
of an "http" URL must also be assumed to be binary; that is, any octet
is allowed, and no character set can be presumed.

History
Date	User	Action	Args
2008-08-08 01:34:05	janssen	set	recipients: + janssen, lemburg, gvanrossum, loewis, jimjjewett, orsenthil, pitrou, thomaspinckney3, mgiuca
2008-08-08 01:34:05	janssen	set	messageid: <1218159245.24.0.0467802626348.issue3300@psf.upfronthosting.co.za>
2008-08-08 01:34:04	janssen	link	issue3300 messages
2008-08-08 01:34:03	janssen	create