This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Artoria2e5
Recipients Artoria2e5, ezio.melotti, vstinner
Date 2016-11-14.20:09:08
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1479154148.6.0.43343383532.issue28693@psf.upfronthosting.co.za>
In-reply-to
Content
Python's cp950 implementation lacks support for HKSCS ('big5hkscs'). This support, which maps HKSCS Big5-EUDC code points to Unicode PUA code points algorithmically, is found in Windows Vista+ as well as an update for XP.

An experiment session is shown below. I will use '2>>>' to denote a Win32 build of Python 2.7.10 running under a console window set to cp950 (via chcp), and '3>>>' to denote a Python 3.4.3 build running under Cygwin's UTF-8 mintty. HKSCS-2008's table is used  http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/hkscs-2008-big5-iso.txt for a list of HKSCS characters; note though, its non-PUA mappings are not found in Windows.

Let's start with the first character in that list.

3>>> u'\u43F0'
'䏰'
3>>> print(u'\uF266') # provisional PUA

3>>> u'\u43F0'.encode('cp950') # FAIL
3>>> u'\uF266'.encode('cp950') # FAIL
3>>> u'\u43F0'.encode('hkscs')
b'\x87@'
3>>> u'\uF266'.encode('hkscs') # FAIL`

These experiments above show how Python 3 handles HKSCS characters, and how U+43F0 should normally be encoded. Now let's switch to Windows console, which would be using Windows' decode-to-Unicode routine for cp950.

2>>> print b'\x87@'


Let's try to identify this character:

3>>> u''
'\uf266'

So indeed there is some sort of HKSCS going on. But note what Windows has is really not any kind of new HKSCS:

> Big5       ucs93                  ucs00                   ucs03 + 1-6
> 876B       9734                   9734                    9734
> 876C       F292                   F292                   27BEF
> 876D       5BDB                   5BDB                    5BDB

2>>> print b'\x87\x6b,\x87\x6c,\x87\x6d'
,,
3>>> u',,'
'\uf291,\uf292,\uf293'

Just as for all other code pages, you can always find Microsoft's mapping at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt. If you are uncomfortable with adding a whole new table and wasting space (this is done for hkscs btw), use the algorithmic mapping at https://en.wikipedia.org/wiki/Code_page_950.
History
Date User Action Args
2016-11-14 20:09:08Artoria2e5setrecipients: + Artoria2e5, vstinner, ezio.melotti
2016-11-14 20:09:08Artoria2e5setmessageid: <1479154148.6.0.43343383532.issue28693@psf.upfronthosting.co.za>
2016-11-14 20:09:08Artoria2e5linkissue28693 messages
2016-11-14 20:09:08Artoria2e5create