This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add support for CESU-8 encoding
Type: enhancement Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.3
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, lemburg, moese
Priority: normal Keywords:

Created on 2011-08-12 14:01 by moese, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg141958 - (view) Author: Moese (moese) Date: 2011-08-12 14:01
CESU-8 is identical with UTF-8 except that it has a different encoding format for surrogate characters.

http://en.wikipedia.org/wiki/CESU-8

It is used by some web APIs.
msg143020 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-26 16:47
Can you provide some example?
The page you linked says "It should be used exclusively for internal processing and never for external data exchange.", so I'm not sure why these APIs would want to use it.
msg143138 - (view) Author: Moese (moese) Date: 2011-08-29 11:50
It's an internal web API at the place I work for.

To be able to use it from Python in some form, I did an workaround in which I just stripped everything outside BMP:

# replace characters outside BMP with 'REPLACEMENT CHARACTER' (U+FFFD)
def cesu8_to_utf8(text):
....result = ""
....index = 0
....length = len(text)
....while index < length:
........if text[index] < "\xf0":
............result += text[index]
............index  += 1
........else:
............result += "\xef\xbf\xbd"  # u"\ufffd".encode("utf8")
............index  += 4
....return result

Now that I look at the workaround again, I'm not even sure it's about CESU-8 (it strips Unicode chars encoded to 4 bytes, not 2 pairs of 3 bytes surrogates).

However I can see why there would be little interest in adding this encoding.
msg143139 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-29 12:22
I'm going to reject this.  If people need it, they can always implement it using the codecs module.
History
Date User Action Args
2022-04-11 14:57:20adminsetgithub: 56951
2011-08-29 12:22:30ezio.melottisetstatus: open -> closed
resolution: rejected
messages: + msg143139

stage: resolved
2011-08-29 11:50:11moesesetmessages: + msg143138
2011-08-26 16:47:45ezio.melottisetnosy: + ezio.melotti
messages: + msg143020
2011-08-12 17:32:14eric.araujosetnosy: + lemburg

components: + Library (Lib)
versions: + Python 3.3, - Python 3.4
2011-08-12 14:01:38moesecreate