classification
Title: Emoji Unicode failing in standard release of Python 3.8.3 / tkinter 8.6.8
Type: behavior Stage: resolved
Components: macOS, Tkinter Versions: Python 3.8
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: ned.deily Nosy List: Ben Griffin, Jim.Jewett, epaine, ned.deily, ronaldoussoren, terry.reedy
Priority: normal Keywords:

Created on 2020-07-05 08:41 by Ben Griffin, last changed 2020-07-11 18:15 by Jim.Jewett. This issue is now closed.

Files
File name Uploaded Description Edit
Emoji.py.txt Ben Griffin, 2020-07-05 08:41 File that shows issue
Messages (7)
msg373019 - (view) Author: Ben Griffin (Ben Griffin) * Date: 2020-07-05 08:41
https://stackoverflow.com/questions/62713741/tkinter-and-32-bit-unicode-duplicating-any-fix

Emoji are doubling up when using canvas.create_text()
This is reported to work on tcl/tk 8.6.10 but there’s no. Way to upgrade tcl/tk using the standard installs from the python.org site
msg373041 - (view) Author: E. Paine (epaine) * Date: 2020-07-05 20:06
This is a Tcl issue, as Tcl is designed for characters up to 16 bits. The fact that Chip is showing at all is very surprising, though any character outside of this 16-bit range should be considered unpredictable.

"The majority of characters used in the human languages of the world have character codes between 0 and 65535, and are known as the Basic Multilingual Plane  (BMP). Currently a default build of Tcl is only capable of handling these characters, but work is underway to change that, and workarounds requiring non-default build-time configuration options exist." [https://wiki.tcl-lang.org/page/Unicode]
msg373051 - (view) Author: Ben Griffin (Ben Griffin) * Date: 2020-07-05 22:24
Erm, I don’t rightly know how to parse epaine’s comment, as it seems to relate to a version of Unicode from over a decade ago, and a wiki page that was written 12 years ago.

IIRC Python 3 was (IMO rightly) developed to default to UTF-8, and according to a much more recently edited article (https://en.m.wikipedia.org/wiki/UTF-8), a normative UTF-8 parser can handle any of the million+ Unicode characters, including emoji.

As I pointed out in the bug report, and as mentioned by contributors on SO, TCL has seems to have fixed these issues by 8.6.10.

If epaine is correct and TCL CANTFIX/WONTFIX normative utf-8 - then maybe it’s time to drop the strong relationship that Python has with tkinter. However Im pretty sure that there is no need for such a drastic measure: the UTF-8 algorithm isn’t that complex.
msg373079 - (view) Author: E. Paine (epaine) * Date: 2020-07-06 09:04
Sorry, the point I was trying to make was that, unlike UTF-8, Tcl doesn't support variable length characters and they are instead fixed at 16 bits (by default). So, while Python and UTF-8 are perfectly happy with the emoji, unless Tcl is compiled with a particular build flag it will not process the character correctly (hence why I said it was surprising that Chip showed at all). I have tested on Tcl 8.6.10 and encountered the same problem described.

A further quote (granted, also old, but I cannot find anything to suggest this behaviour has been changed):
"Tcl can (currently) only represent characters within the Basic Multilingual Plane of Unicode, so there's no way that you can even feed an U+10000 into encoding convertto :-(. Fixing that is non-trivial, since some parts of Tcl (the C library) require a representation of strings where all characters take up the same number of bytes. It is possible to compile Tcl with that "number of bytes" set to 4 (meaning 32 bits per character), but it's rather wasteful, and has been reported not entirely compatible with Tk." [https://wiki.tcl-lang.org/page/utf-8]

If I can find the build flag mentioned, I will post it here for future reference.
msg373089 - (view) Author: Ben Griffin (Ben Griffin) * Date: 2020-07-06 10:17
Wow, well if you are right, then TCL/TK is a showstopper for us, and we will have to consider an alternative to tkinter. 

Frankly, I am aghast that any active software would be limited to fixed width characters.

We moved our languages over to multiwidth (utf-8) back in 2003: most of the changes were restricted to a handful of string functions (strcut, strlen, etc.). Compiling TCL to use 4 byte chars isn’t really a solution either.

What confuses me is that there are several people on SO who are saying ‘works for me’.
msg373495 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-07-11 01:44
Python internally uses an encoding system that represents all unicode chars efficiently, including O(1) indexing.  It is not utf-8, which does not do O(1) indexing.

There is already an issue about upgrading (separately) the Python Windows and macOS installers to install tcl/tk 8.6.9.

With the currrent 8.6.9 and probably earlier, and since an important tkinter patch last fall for #13153, a tkinter/tk text widget will display astral characters that the font in use can produce.  For example, in 3.9.0, I see the TV set printed in IDLE
>>> '\U0001f4bb'
'💻'
but not in the Windows Console Python REPL, which shows 'box space box box'.

However, astral characters discombobutate editing (#39126),at least on Windows, they are counted as 2 or 4 chars.  The difference between behavior before and after Serhiy's patch and between display and editing likely explains different reports on SO.
msg373531 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2020-07-11 18:15
@Ben Griffin -- Unicode has defined astral characters for a while, but they were explicitly intended for rare characters, with any living languages intended for the basic plane.  It is only the most recent releases of unicode that have broken the "most people won't need this" expectation, so it wasn't unreasonable for languages targeting memory-constrained devices to make astral support at best a compile-time operation.  

I've seen a draft for an upcoming spec update of an old but still-supported language (extended Gerber, for photoplotting machines) that "handles" this simply by clarifying that their unicode support is limited to characters < 65K.  Given that their use of unicode is essentially limited to comments, and there is plenty of hardware that can't be updated ... this is may well be correct.

Python itself does the right thing, and tcl can't do the right thing anyhow without font support ... so this may be fixed in less time than it would take to replace Tk/Tcl.  If you need a faster workaround, consider a private-use-area and private font.
History
Date User Action Args
2020-07-11 18:15:59Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg373531
2020-07-11 01:44:57terry.reedysetstatus: open -> closed

nosy: + terry.reedy
messages: + msg373495

resolution: third party
stage: resolved
2020-07-06 10:17:16Ben Griffinsetmessages: + msg373089
2020-07-06 09:04:08epainesetmessages: + msg373079
2020-07-05 22:24:56Ben Griffinsetmessages: + msg373051
2020-07-05 20:06:52epainesetnosy: + epaine
messages: + msg373041
2020-07-05 10:41:30ned.deilysetassignee: ned.deily
2020-07-05 09:03:05SilentGhostsetnosy: + ronaldoussoren, ned.deily
components: + macOS
2020-07-05 08:41:53Ben Griffincreate