This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add Big5-ETen codec: Python big5 codec cannot decode \xf9\xd8 bytes (U+7881 expected)
Type: behavior Stage:
Components: Unicode Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: hyeshik.chang Nosy List: Xuefer.x, batterseapower, hyeshik.chang, inndy, kennyluck, loewis, rpetrov, vstinner
Priority: normal Keywords:

Created on 2010-02-05 05:07 by Xuefer.x, last changed 2022-04-11 14:56 by admin.

Messages (11)
msg98865 - (view) Author: Xuefer x (Xuefer.x) Date: 2010-02-05 05:07
using iconv:
$ printf "\xf9\xd8" | iconv -f big5 -t utf-8 | xxd
0000000: e8a3 8f                                  ...
$ printf "\xe8\xa3\x8f" | iconv -f utf-8 -t big5 | xxd
0000000: f9d8                                     ..

using python
>>> print "\xf9\xd8".decode("big5")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence
>>> print "\xe8\xa3\x8f".decode("utf-8").encode("big5")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'big5' codec can't encode character u'\u88cf' in position 0: illegal multibyte sequence
msg98866 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-02-05 05:16
That iconv supports it is not convincing, IMO. Do you have other sources (like tables in the web somewhere) that support your request?
msg98867 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-02-05 05:41
In particular, the Unicode consortium mapping table, now at

http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT

doesn't map f9d8 to anything; the current version of that table (in unihan.zip) has these mappings for U+88CF:

U+88CF	kCCCII	232E61
U+88CF	kCNS1986	E-444E
U+88CF	kCNS1992	3-444E
U+88CF	kEACC	215763
U+88CF	kGB1	3279
U+88CF	kHKSCS	F9D8
U+88CF	kJis0	4602
U+88CF	kKPS0	D9E0
U+88CF	kKSC0	5574
U+88CF	kTaiwanTelegraph	5937
U+88CF	kXerox	241:102

As you can see, it isn't supported in big5.
msg98868 - (view) Author: Xuefer x (Xuefer.x) Date: 2010-02-05 06:05
sure after enlighten by your url which is OBSOLETE
see: http://www.unicode.org/Public/MAPPINGS/EASTASIA/ReadMe.txt
i found http://unicode.org/charts/unihan.html
then http://www.unicode.org/Public/UNIDATA/
then http://www.unicode.org/Public/UNIDATA/Unihan.zip
in side the zip, open Unihan_OtherMappings.txt
big 5 includes
#	kBigFive
#	kHKSCS
which are listed in Unihan_OtherMappings.txt
HKSCS is one of the big-5 encoding
and i search for F9D8 got
U+88CF	kHKSCS	F9D8

you may also want to update other encoding map table to catch up with Unihan_OtherMappings.txt

thanks for your quick reply btw
msg98869 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-02-05 06:23
perky, what do you think?
msg98911 - (view) Author: Roumen Petrov (rpetrov) * Date: 2010-02-05 21:54
> That iconv supports it is not convincing, ...

GNU libc is not convincing . What you talking about ?
msg218790 - (view) Author: Inndy (inndy) Date: 2014-05-19 12:53
I'm Taiwanese, F9D8 in big5 should be mapped to E8A38F in UTF-8.
msg218801 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-05-19 17:21
I'm still looking for an official source of that.

>>> u"\u88cf".encode("big5hkscs")
'\xf9\xd8'

works fine (and always has been working fine), and the character clearly is in big5hkscs. According to 

http://en.wikipedia.org/wiki/Big5

F9D8 is "Reserved for user-defined characters", so this suggests that the character does *not* have a fixed meaning in BIG-5. However, it is part of the Hong Kong Supplementary Character Set.
msg218804 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-05-19 17:50
Inndy, you might also be talking about big5-2003, from

r92030/project/big5/">http://www.csie.ntu.edu.tw/~r92030/project/big5/

Python currently does not support big5-2003, but a contribution of such an encoding would surely be welcome.
msg388365 - (view) Author: Max Bolingbroke (batterseapower) Date: 2021-03-09 15:25
As of Python 3.7.9 this also affects \xf9\xd6 which should be \u7881 in Unicode. This character is the second character of 宏碁 which is the name of the Taiwanese electronics manufacturer Acer.

You can work around the issue using big5hkscs just like with the original \xf9\xd8 problem.

It looks like the F9D6–F9FE characters all come from the Big5-ETen extension (https://en.wikipedia.org/wiki/Big5#ETEN_extensions, https://moztw.org/docs/big5/table/eten.txt) which is so popular that it is a defacto standard. Big5-2003 (mentioned in a comment below) seems to be an extension of Big5-ETen. For what it's worth, whatwg includes these mappings in their own big5 reference tables: https://encoding.spec.whatwg.org/big5.html. 

Unfortunately Big5 is still in common use in Taiwan. It's pretty funny that Python fails to decode Big5 documents containing the name of one of Taiwan's largest multinationals :-)
msg388380 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-09 20:26
> It looks like the F9D6–F9FE characters all come from the Big5-ETen extension

One option would be to add a new big5eten encoding to Python. Someone has to implement the code.
History
Date User Action Args
2022-04-11 14:56:57adminsetgithub: 52104
2021-03-09 20:27:35vstinnersettitle: cannot decode from or encode to big5 \xf9\xd8 -> Add Big5-ETen codec: Python big5 codec cannot decode \xf9\xd8 bytes (U+7881 expected)
2021-03-09 20:26:41vstinnersetmessages: + msg388380
2021-03-09 15:25:26batterseapowersetnosy: + batterseapower
messages: + msg388365
2014-05-19 17:50:57loewissetmessages: + msg218804
2014-05-19 17:21:25loewissetmessages: + msg218801
2014-05-19 12:53:14inndysetnosy: + inndy
messages: + msg218790
2012-02-01 14:10:22pitrousetnosy: + vstinner

versions: + Python 2.7, Python 3.2, Python 3.3, - Python 2.6
2012-01-31 16:41:14kennylucksetnosy: + kennyluck
2010-02-05 21:54:39rpetrovsetnosy: + rpetrov
messages: + msg98911
2010-02-05 06:23:02loewissetassignee: hyeshik.chang

messages: + msg98869
nosy: + hyeshik.chang
2010-02-05 06:05:10Xuefer.xsetmessages: + msg98868
2010-02-05 05:41:48loewissetmessages: + msg98867
2010-02-05 05:16:15loewissetnosy: + loewis
messages: + msg98866
2010-02-05 05:07:48Xuefer.xcreate