Title: UTF-16-LE and UTF-16-BE support non-BMP characters
Type: Stage: commit review
Components: Documentation, Unicode Versions: Python 3.1, Python 3.2
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: belopolsky, cgw, docs@python, lemburg, terry.reedy, vstinner
Priority: normal Keywords: patch

Created on 2010-11-26 21:08 by vstinner, last changed 2010-12-08 22:26 by vstinner. This issue is now closed.

File name Uploaded Description Edit
utf_16_bmp.patch vstinner, 2010-11-26 21:08
Messages (5)
msg122479 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-26 21:08
Python3 doc tells that UTF-16-LE and UTF-16-BE only support BMP characters. What? I think that it is wrong.

It was maybe wrong with Python2 and narrow build (unichr() only supports BMP characters), but it is no more true in Python3.
msg123650 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-12-08 20:41
Marc or Alexander, can you confirm that the patch is correct?
msg123651 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-08 21:48
If Victor says so ...

Someone needs to check that it works on a UCS4 build, but on a narrow build I don't think UTF-16-XX encodings need to do anything special - they just encode the surrogates as ordinary code units.

>>> '\U00010000'.encode('UTF-16-BE').decode('UTF-16-BE') == '\U00010000'
>>> '\U00010000'.encode('UTF-16-LE').decode('UTF-16-LE') == '\U00010000'
msg123654 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-08 22:04
I have verified that UTF-16-XX encodings work on wide build.  The doc change LGTM.  Bonus points for checking that we have unit tests for these encodings that include non-BMP characters.
msg123657 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-12-08 22:26
Fixed by r87135.
Date User Action Args
2010-12-08 22:26:33vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg123657
2010-12-08 22:05:50belopolskysetnosy: lemburg, terry.reedy, cgw, belopolsky, vstinner, docs@python
components: + Unicode
2010-12-08 22:04:12belopolskysetmessages: + msg123654
2010-12-08 21:48:09belopolskysetmessages: + msg123651
2010-12-08 20:57:30terry.reedysetassignee: docs@python
2010-12-08 20:57:07terry.reedysetassignee: cgw -> (no value)
2010-12-08 20:41:04terry.reedysetnosy: + terry.reedy, belopolsky, cgw, lemburg
messages: + msg123650

assignee: docs@python -> cgw
stage: commit review
2010-11-26 21:08:30vstinnercreate