Message 72316 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rhamphoryncus
Recipients	Rhamphoryncus, benjamin.peterson, ezio.melotti, lemburg, terry.reedy
Date	2008-09-02.06:51:53
SpamBayes Score	3.093973e-06
Marked as misclassified	No
Message-id	<1220338314.81.0.154303338973.issue3297@psf.upfronthosting.co.za>
In-reply-to

Content
Marc, I don't understand what you're saying. UTF-16's surrogates are not optional. Unicode 2.0 and later require them, and Python is supposed to support it. Likewise, UCS-4 originally allowed a much larger range of code points, but it no longer does; allowing them would mean supporting only old, archaic versions of the standards (which is clearly not desirable.) You are right in that I shouldn't have said "a pair of ill-formed code units". I should have said "a pair of unassigned code points", which is how UCS-2 always have and always will classify them. Although python may allow ill-formed sequences to be created internally (primarily lone surrogates on UTF-16 builds), it cannot encode or decode them. The standard is clear that these are to be treated as errors, which the .decode()'s "errors" argument controls. You could add a new value for "errors" to pass-through the garbage, but I fail to see a use case for it.

Marc, I don't understand what you're saying.  UTF-16's surrogates are
not optional.  Unicode 2.0 and later require them, and Python is
supposed to support it.

Likewise, UCS-4 originally allowed a much larger range of code points,
but it no longer does; allowing them would mean supporting only old,
archaic versions of the standards (which is clearly not desirable.)

You are right in that I shouldn't have said "a pair of ill-formed code
units".  I should have said "a pair of unassigned code points", which is
how UCS-2 always have and always will classify them.

Although python may allow ill-formed sequences to be created internally
(primarily lone surrogates on UTF-16 builds), it cannot encode or decode
them.  The standard is clear that these are to be treated as errors,
which the .decode()'s "errors" argument controls.  You could add a new
value for "errors" to pass-through the garbage, but I fail to see a use
case for it.

History
Date	User	Action	Args
2008-09-02 06:51:55	Rhamphoryncus	set	recipients: + Rhamphoryncus, lemburg, terry.reedy, benjamin.peterson, ezio.melotti
2008-09-02 06:51:54	Rhamphoryncus	set	messageid: <1220338314.81.0.154303338973.issue3297@psf.upfronthosting.co.za>
2008-09-02 06:51:54	Rhamphoryncus	link	issue3297 messages
2008-09-02 06:51:53	Rhamphoryncus	create