Message 72161 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	Rhamphoryncus, benjamin.peterson, ezio.melotti, lemburg, terry.reedy
Date	2008-08-29.21:33:08
SpamBayes Score	1.0506762e-10
Marked as misclassified	No
Message-id	<1220045591.76.0.520722098503.issue3297@psf.upfronthosting.co.za>
In-reply-to

Content
"Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16 vs. UTF-32)" I recently read most of the Unicode 5 standard and as near as I could tell it no longer uses the term UCS, if it ever did. Chapter 3 has only the following 3 hits. 1. "D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. • For historical reasons, the Unicode encoding forms are also referred to as Unicode (or UCS) transformation formats (UTF). That term is actually ambiguous between its usage for encoding forms and encoding schemes." 2. "For a discussion of the relationship between UTF-32 and UCS-4 encoding form defined in ISO/IEC 10646, see Section C.2, Encoding Forms in ISO/IEC 10646." Section C.2 says "UCS-4 can now be taken effectively as an alias for the Unicode encoding form UTF-32" and mentions the restriction of UCS-2 to the BMP. 3. "ISO/IEC 10646 specifies an equivalent UTF-16 encoding form. For details, see Section C.3, UCS Transformation Formats." U5 has 3 coding formats which it names UTF-8,16,32 and 7 serialization formats of the same name with plus the latter two with 'BE' or 'LE' append. So, to me, use of 'UCS' is either confusing or misleading. ---------------------- "If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. It'd be a pair of ill-formed code units instead." On WinXP,IDLE 3.0b2 >>> repr('\U00010123') # u prefix no longer needed or valid "'𐄣'" >>> repr('\ud800\udd23') "'𐄣'" # Interesting: what I cut from IDLE has 2 empty boxes instead of the one larger square with 010 and 123 I see on FireFox. len(repr('\U0010123')) is 4, not 3, so FireFox recognizes the surrogate and displays one symbol. Entering either directly into the interpreter gives Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit (Intel)] on win32 >>> c='\U00010123' >>> len(c) 2 >>> repr(c) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Program Files\Python30\lib\io.py", line 1428, in write b = encoder.encode(s) File "C:\Program Files\Python30\lib\encodings\cp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 2-3: character maps to <undefined> 2.5 gives instead "u'\\U00010123'" as reported, so I added 3.0 to the list of versions with a problem. I do wonder how can repr() work on IDLE but not the underlying interpreter? Could IDLE change self.errors so that <undefined> is left as is instead of raising an exception? With the display then replacing those with empty boxes?

"Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16
vs. UTF-32)"

I recently read most of the Unicode 5 standard and as near as I could
tell it no longer uses the term UCS, if it ever did.  Chapter 3 has only
the following 3 hits.

1. "D79 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit sequence.
• For historical reasons, the Unicode encoding forms are also referred
to as Unicode (or UCS) transformation formats (UTF). That term is
actually ambiguous between its usage for encoding forms and encoding
schemes."

2. "For a discussion of the relationship between UTF-32 and UCS-4
encoding form defined in ISO/IEC 10646, see Section C.2, Encoding Forms
in ISO/IEC 10646."

Section C.2 says "UCS-4 can now be taken effectively as an alias for the
Unicode encoding form UTF-32" and mentions the restriction of UCS-2 to
the BMP.

3. "ISO/IEC 10646 specifies an equivalent UTF-16 encoding form.
For details, see Section C.3, UCS Transformation Formats."

U5 has 3 coding formats which it names UTF-8,16,32 and 7 serialization
formats of the same name with plus the latter two with 'BE' or 'LE'
append.  So, to me, use of 'UCS' is either confusing or misleading.

----------------------
"If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. 
It'd be a pair of ill-formed code units instead."

On WinXP,IDLE 3.0b2     
>>> repr('\U00010123') # u prefix no longer needed or valid
"'𐄣'"
>>> repr('\ud800\udd23')
"'𐄣'"
# Interesting: what I cut from IDLE has 2 empty boxes instead of the one
larger square with 010 and 123 I see on FireFox.  len(repr('\U0010123'))
is 4, not 3, so FireFox recognizes the surrogate and displays one symbol.

Entering either directly into the interpreter gives
Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit
(Intel)] on win32
>>> c='\U00010123'
>>> len(c)
2
>>> repr(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python30\lib\io.py", line 1428, in write
    b = encoder.encode(s)
  File "C:\Program Files\Python30\lib\encodings\cp437.py", line 19, in
encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
2-3: character maps to <undefined> 

2.5 gives instead "u'\\U00010123'" as reported, so I added 3.0 to the
list of versions with a problem.

I do wonder how can repr() work on IDLE but not the underlying
interpreter?  Could IDLE change self.errors so that <undefined> is left
as is instead of raising an exception?  With the display then replacing
those with empty boxes?

History
Date	User	Action	Args
2008-08-29 21:33:12	terry.reedy	set	recipients: + terry.reedy, lemburg, Rhamphoryncus, benjamin.peterson, ezio.melotti
2008-08-29 21:33:11	terry.reedy	set	messageid: <1220045591.76.0.520722098503.issue3297@psf.upfronthosting.co.za>
2008-08-29 21:33:10	terry.reedy	link	issue3297 messages
2008-08-29 21:33:08	terry.reedy	create