classification
Title: IDLE can't deal with characters above the range (U+0000-U+FFFF)
Type: behavior Stage: needs patch
Components: IDLE, Tkinter, Unicode Versions: Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Ma Lin, ezio.melotti, haypo, irdb, serhiy.storchaka, terry.reedy
Priority: high Keywords: patch

Created on 2014-03-28 12:01 by Ma Lin, last changed 2016-05-28 21:13 by BreamoreBoy.

Files
File name Uploaded Description Edit
idle_fix_non_bmp.patch serhiy.storchaka, 2014-07-12 11:20 review
nonbmp_except_check.patch Ma Lin, 2014-07-25 00:59 review
nonbmp_except_check_v2.patch Ma Lin, 2014-07-25 03:10 review
Messages (12)
msg215038 - (view) Author: Ma Lin (Ma Lin) * Date: 2014-03-28 12:01
When open a file with characters above the range (U+0000-U+FFFF), IDLE quit without any report. For example, open this file \Lib\test\test_re.py

The below is Traceback info, the last line tells the reason. I just hope IDLE say something before quit, so we can know what happend.

I have checked Python 3.3.5 and 3.4.0, they have the same problem. I didn't find a 3.5 build, so I can't test this problem under 3.5.

=============================================
Exception in Tkinter callback
Traceback (most recent call last):
  File "C:\Python33\lib\tkinter\__init__.py", line 1489, in __call__
    return self.func(*args)
  File "C:\Python33\lib\idlelib\IOBinding.py", line 186, in open
    flist.open(filename)
  File "C:\Python33\lib\idlelib\FileList.py", line 36, in open
    edit = self.EditorWindow(self, filename, key)
  File "C:\Python33\lib\idlelib\PyShell.py", line 126, in __init__
    EditorWindow.__init__(self, *args)
  File "C:\Python33\lib\idlelib\EditorWindow.py", line 288, in __init__
    if io.loadfile(filename):
  File "C:\Python33\lib\idlelib\IOBinding.py", line 236, in loadfile
    self.text.insert("1.0", chars)
  File "C:\Python33\lib\idlelib\Percolator.py", line 25, in insert
    self.top.insert(index, chars, tags)
  File "C:\Python33\lib\idlelib\UndoDelegator.py", line 81, in insert
    self.addcmd(InsertCommand(index, chars, tags))
  File "C:\Python33\lib\idlelib\UndoDelegator.py", line 116, in addcmd
    cmd.do(self.delegate)
  File "C:\Python33\lib\idlelib\UndoDelegator.py", line 219, in do
    text.insert(self.index1, self.chars, self.tags)
  File "C:\Python33\lib\idlelib\ColorDelegator.py", line 85, in insert
    self.delegate.insert(index, chars, tags)
  File "C:\Python33\lib\idlelib\WidgetRedirector.py", line 104, in __call__
    return self.tk_call(self.orig_and_operation + args)
_tkinter.TclError: character U+1d518 is above the range (U+0000-U+FFFF) allowed by Tcl
msg215039 - (view) Author: Ma Lin (Ma Lin) * Date: 2014-03-28 12:02
When open a file with characters above the range (U+0000-U+FFFF), IDLE quit without any report. For example, open this file C:\Python33\lib\test\test_re.py

The below is Traceback info, the last line tells the reason. I just hope IDLE say something before quit, so we can know what happend.

I have checked Python 3.3.5 and 3.4.0, they have the same problem. I didn't find a 3.5 build, so I can't test this problem under 3.5.

=============================================
Exception in Tkinter callback
Traceback (most recent call last):
  File "C:\Python33\lib\tkinter\__init__.py", line 1489, in __call__
    return self.func(*args)
  File "C:\Python33\lib\idlelib\IOBinding.py", line 186, in open
    flist.open(filename)
  File "C:\Python33\lib\idlelib\FileList.py", line 36, in open
    edit = self.EditorWindow(self, filename, key)
  File "C:\Python33\lib\idlelib\PyShell.py", line 126, in __init__
    EditorWindow.__init__(self, *args)
  File "C:\Python33\lib\idlelib\EditorWindow.py", line 288, in __init__
    if io.loadfile(filename):
  File "C:\Python33\lib\idlelib\IOBinding.py", line 236, in loadfile
    self.text.insert("1.0", chars)
  File "C:\Python33\lib\idlelib\Percolator.py", line 25, in insert
    self.top.insert(index, chars, tags)
  File "C:\Python33\lib\idlelib\UndoDelegator.py", line 81, in insert
    self.addcmd(InsertCommand(index, chars, tags))
  File "C:\Python33\lib\idlelib\UndoDelegator.py", line 116, in addcmd
    cmd.do(self.delegate)
  File "C:\Python33\lib\idlelib\UndoDelegator.py", line 219, in do
    text.insert(self.index1, self.chars, self.tags)
  File "C:\Python33\lib\idlelib\ColorDelegator.py", line 85, in insert
    self.delegate.insert(index, chars, tags)
  File "C:\Python33\lib\idlelib\WidgetRedirector.py", line 104, in __call__
    return self.tk_call(self.orig_and_operation + args)
_tkinter.TclError: character U+1d518 is above the range (U+0000-U+FFFF) allowed by Tcl
msg215040 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-03-28 12:31
See #13153.
msg222817 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-07-12 01:03
Accidentally set to pending I take it.
msg222834 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-07-12 11:20
Yes, this is very similar to issue13153. Both these issues can have same solution or can have different solutions. This issue relates to more realistic situation and therefore is more important.

Here is simple and almost working solution for this issue. Unfortunately it works incorrectly when astral characters are encountered in raw string literals. More mature solution should parse sources and convert raw string literals containing astral characters to non-raw string literals. But this will not work with invalid Python files and non-Python files.

I afraid this issue has not perfect solution. The question is which imperfect solution and compromise we will decided enough acceptable.
msg223007 - (view) Author: Ma Lin (Ma Lin) * Date: 2014-07-14 09:15
I suggest don't change the content of file, just give a message such as:

IDLE can't display non-BMP character (codepoint above 0xFFFF).
A non-BMP character found in Line 23, position 8 of aaaa.py, please open this file with other editor.
msg223843 - (view) Author: Ma Lin (Ma Lin) * Date: 2014-07-24 14:47
I wrote this code, but I don't know how to make a patch.

Insert these codes in C:\Python34\Lib\idlelib\IOBinding.py
Around line 234, before this line:
self.text.delete("1.0", "end")


        # check non-bmp characters
        line_count = 1
        position_count = 1
        for char in chars:
            if char == '\n':
                line_count += 1
                position_count = 1
            if ord(char) > 0xFFFF:
                nonbmp_msg = ("IDLE can't display non-BMP characters "
                              "(codepoint above 0xFFFF).\n"
                              "A non-BMP character found at line %d, "
                              "position %d of file %s, codepoint 0x%X.\n"
                              "Please open this file with another editor.")
                tkMessageBox.showerror("non-BMP character",
                                        nonbmp_msg %
                                       (line_count, position_count,
                                        filename, ord(char)),
                                       parent=self.text)
                return False
            position_count += 1
msg223846 - (view) Author: Ma Lin (Ma Lin) * Date: 2014-07-24 14:56
Changing the second "if" to "elif" is better.

I'm sorry, I have never submitted patch.
If somebody gives a hand, feel free to modify those codes.
msg223848 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-07-24 15:26
See https://docs.python.org/devguide/patch.html
msg223912 - (view) Author: Ma Lin (Ma Lin) * Date: 2014-07-25 00:59
Feel free to modify this patch.
msg223915 - (view) Author: Ma Lin (Ma Lin) * Date: 2014-07-25 03:10
nonbmp_except_check_v2.patch changes character numbers to 0-based, same as IDLE.

Quote from www.tkdocs.com :
"for historical conventions related to how programmers normally refer to lines and characters, line numbers are 1-based, and character numbers are 0-based."
msg266443 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-05-26 16:07
Tk Text (and other widgets, but Text is the main issue) has two display problems: astral chars and long lines (over a thousand chars, say).  These problems can manifest in various places: file names, shell input (keyboard or clipboard), shell output, editor input (keyboard, clipboard, or file).  IDLE needs to take more control over what is displayed to work around both problems.

Tk Text also has a display feature: substring tagging.  I have been heistant to simple replace astral chars with their \U000hhhhh expansion because of the aliasing problem: in shell output, for instance, the user would not know if the program wrote 1 char or 10.  It would also be impossible to know if a reverse transformation might be needed.  Tagging astral expansions would solve both problems.

import re

astral = re.compile(r'([^\x00-\uffff])')
s = 'X\U00011111Y\U00011112\U00011113Z'
for i, ss in enumerate(re.split(astral, s)):
    if not i%2:
        print(ss, end='')
    else:
        print(r'\\U%08x' % ord(ss), end='')
# prints
X\\U00011111Y\\U00011112\\U00011113Z

Now replace print with test.insert, with an 'astral' tag for the second.  tk will not double '\'s.  Astral tag could switch, for instance, to underline version of current font.  This should work with any color scheme.

[Separate but related issue: augment Format or context menu with functions to convert between literal char, escape string, and name representation (using unicodedatabase).]
History
Date User Action Args
2016-05-28 21:13:39BreamoreBoysetnosy: - BreamoreBoy
2016-05-26 16:07:45terry.reedysetversions: + Python 3.6, - Python 2.7, Python 3.4
nosy: + terry.reedy

messages: + msg266443

resolution: duplicate ->
2015-12-06 12:59:11irdbsetnosy: + irdb
2014-07-25 03:10:31Ma Linsetfiles: + nonbmp_except_check_v2.patch

messages: + msg223915
2014-07-25 00:59:10Ma Linsetfiles: + nonbmp_except_check.patch

messages: + msg223912
2014-07-24 15:26:46ezio.melottisetmessages: + msg223848
2014-07-24 14:56:43Ma Linsetmessages: + msg223846
2014-07-24 14:47:04Ma Linsetmessages: + msg223843
2014-07-15 18:56:17serhiy.storchakasetstage: needs patch
2014-07-14 10:15:09rhettingersetpriority: normal -> high
2014-07-14 09:15:13Ma Linsetmessages: + msg223007
2014-07-12 11:20:02serhiy.storchakasetfiles: + idle_fix_non_bmp.patch

assignee: serhiy.storchaka
components: + Tkinter, Unicode
versions: + Python 2.7, Python 3.5, - Python 3.3
keywords: + patch
nosy: + haypo

messages: + msg222834
2014-07-12 01:03:50BreamoreBoysetstatus: pending -> open
nosy: + BreamoreBoy
messages: + msg222817

2014-03-28 12:31:15ezio.melottisetstatus: open -> pending

nosy: + ezio.melotti, serhiy.storchaka
messages: + msg215040

type: crash -> behavior
resolution: duplicate
2014-03-28 12:02:51Ma Linsetmessages: + msg215039
2014-03-28 12:01:05Ma Lincreate