Message 70958 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-10.03:09:50
SpamBayes Score	1.7591145e-07
Marked as misclassified	No
Message-id	<1218337795.42.0.906599600861.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
> Bill's main concern is with a policy decision; I doubt he would > object to using your code once that is resolved. But his patch does the same basic operations as mine, just implemented differently and with the heap of issues I outlined above. So it doesn't have anything to do with the policy decision. > The purpose of the quoting functions is to turn a string > (representing the human-readable version) into bytes (that go > over the wire). Ah hang on, that's a misunderstanding. There is a two-step process involved. Step 1. Translate <character/byte> string into an ASCII character string by percent-encoding the <characters/bytes>. (If percent-encoding characters, use an unspecified encoding). Step 2. Serialize the ASCII character string into an octet sequence to send it over the wire, using some unspecified encoding. Step 1 is explained in detail throughout the RFC, particularly in Section 1.2.1 Transcription ("Percent-encoded octets may be used within a URI to represent characters outside the range of the US-ASCII coded character set") and 2.1 Percent Encoding. Step 2 is not actually part of the spec (because the spec outlines URIs as character sequences, not how to send them over a network). It is briefly described in Section 2 ("This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol"). Section 1.2.1: > A URI may be represented in a variety of ways; e.g., ink on > paper, pixels on a screen, or a sequence of character > encoding octets. The interpretation of a URI depends only on > the characters used and not on how those characters are > represented in a network protocol. The RFC then goes on to describe a scenario of writing a URI down on a napkin, before stating: > A URI is a sequence of characters that is not always represented > as a sequence of octets. Right, so there is no debate that a URI (after percent-encoding) is a character string, not a byte string. The debate is only whether it's a character or byte string before percent-encoding. Therefore, the concept of "quote_as_bytes" is flawed. > You feel wire-protocol bytes should be treated as > strings, if only as bytestrings, because the libraries use them > that way. No I do not. URIs post-encoding are character strings, in the Unicode sense of the term "character". This entire topic has nothing to do with the wire. Note that the "charset" or "encoding" parameter in Bill/My patch respectively isn't the mapping from URI strings to octets (that's trivially ASCII). It's the charset used to encode character information into octets which then get percent-encoded. > The old code (and test cases) assumed Latin-1. No, the old code and test cases were written for Python 2.x. They assumed a byte string was being emitted (back when a byte string was a string, so that was an acceptable output type). So they weren't assuming an encoding. In fact the ONLY test case for Unicode in test_urllib used a UTF-8-encoded string. > r = urllib.parse.unquote('br%C3%BCckner_sapporo_20050930.doc') > self.assertEqual(r, 'br\xc3\xbcckner_sapporo_20050930.doc') In Python 2.x, this test case says "unquote('%C3%BC') should give me the byte sequence '\xc3\xbc'", which is a valid case. In Python 3.0, the code didn't change but the meaning subtly did. Now it says "unquote('%C3%BC') should give the string 'Ã¼'". The name is clearly supposed to be "brückner", not "brÃ¼ckner", which means in Python 3.0 we should EITHER be expecting the BYTE string b'\xc3\xbc' or the character string 'ü'. So the old code and test cases didn't assume any encoding, then they were accidentally made to assume Latin-1 by the fact that the language changed underneath them.

> Bill's main concern is with a policy decision; I doubt he would
> object to using your code once that is resolved.

But his patch does the same basic operations as mine, just implemented
differently and with the heap of issues I outlined above. So it doesn't
have anything to do with the policy decision.

> The purpose of the quoting functions is to turn a string
> (representing the human-readable version) into bytes (that go
> over the wire).

Ah hang on, that's a misunderstanding. There is a two-step process involved.

Step 1. Translate <character/byte> string into an ASCII character string
by percent-encoding the <characters/bytes>. (If percent-encoding
characters, use an unspecified encoding).
Step 2. Serialize the ASCII character string into an octet sequence to
send it over the wire, using some unspecified encoding.

Step 1 is explained in detail throughout the RFC, particularly in
Section 1.2.1 Transcription ("Percent-encoded octets may be used within
a URI to represent characters outside the range of the US-ASCII coded
character set") and 2.1 Percent Encoding.

Step 2 is not actually part of the spec (because the spec outlines URIs
as character sequences, not how to send them over a network). It is
briefly described in Section 2 ("This specification does not mandate any
particular character encoding for mapping between URI characters and the
octets used to store or transmit those characters.  When a URI appears
in a protocol element, the character encoding is defined by that protocol").

Section 1.2.1:

> A URI may be represented in a variety of ways; e.g., ink on
> paper, pixels on a screen, or a sequence of character
> encoding octets.  The interpretation of a URI depends only on
> the characters used and not on how those characters are
> represented in a network protocol.

The RFC then goes on to describe a scenario of writing a URI down on a
napkin, before stating:

> A URI is a sequence of characters that is not always represented
> as a sequence of octets.

Right, so there is no debate that a URI (after percent-encoding) is a
character string, not a byte string. The debate is only whether it's a
character or byte string before percent-encoding.

Therefore, the concept of "quote_as_bytes" is flawed.

> You feel wire-protocol bytes should be treated as
> strings, if only as bytestrings, because the libraries use them
> that way.

No I do not. URIs post-encoding are character strings, in the Unicode
sense of the term "character". This entire topic has nothing to do with
the wire.

Note that the "charset" or "encoding" parameter in Bill/My patch
respectively isn't the mapping from URI strings to octets (that's
trivially ASCII). It's the charset used to encode character information
into octets which then get percent-encoded.

> The old code (and test cases) assumed Latin-1.

No, the old code and test cases were written for Python 2.x. They
assumed a byte string was being emitted (back when a byte string was a
string, so that was an acceptable output type). So they weren't assuming
an encoding. In fact the *ONLY* test case for Unicode in test_urllib
used a UTF-8-encoded string.

> r = urllib.parse.unquote('br%C3%BCckner_sapporo_20050930.doc')
> self.assertEqual(r, 'br\xc3\xbcckner_sapporo_20050930.doc')

In Python 2.x, this test case says "unquote('%C3%BC') should give me the
byte sequence '\xc3\xbc'", which is a valid case. In Python 3.0, the
code didn't change but the meaning subtly did. Now it says
"unquote('%C3%BC') should give the string 'Ã¼'". The name is clearly
supposed to be "brückner", not "brÃ¼ckner", which means in Python 3.0 we
should EITHER be expecting the BYTE string b'\xc3\xbc' or the character
string 'ü'.

So the old code and test cases didn't assume any encoding, then they
were accidentally made to assume Latin-1 by the fact that the language
changed underneath them.

History
Date	User	Action	Args
2008-08-10 03:09:55	mgiuca	set	recipients: + mgiuca, lemburg, gvanrossum, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3
2008-08-10 03:09:55	mgiuca	set	messageid: <1218337795.42.0.906599600861.issue3300@psf.upfronthosting.co.za>
2008-08-10 03:09:53	mgiuca	link	issue3300 messages
2008-08-10 03:09:50	mgiuca	create