Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDLE can't deal with characters above the range (U+0000-U+FFFF) #65283

Closed
animalize mannequin opened this issue Mar 28, 2014 · 17 comments
Closed

IDLE can't deal with characters above the range (U+0000-U+FFFF) #65283

animalize mannequin opened this issue Mar 28, 2014 · 17 comments
Assignees
Labels
topic-IDLE topic-tkinter topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@animalize
Copy link
Mannequin

animalize mannequin commented Mar 28, 2014

BPO 21084
Nosy @terryjreedy, @vstinner, @ezio-melotti, @serhiy-storchaka, @animalize, @Codeberg-AsGithubAlternative-buhtz
Files
  • idle_fix_non_bmp.patch
  • nonbmp_except_check.patch
  • nonbmp_except_check_v2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2019-10-04.12:03:40.486>
    created_at = <Date 2014-03-28.12:01:05.782>
    labels = ['expert-IDLE', 'type-bug', 'expert-tkinter', 'expert-unicode']
    title = "IDLE can't deal with characters above the range (U+0000-U+FFFF)"
    updated_at = <Date 2019-10-04.18:50:19.345>
    user = 'https://github.com/animalize'

    bugs.python.org fields:

    activity = <Date 2019-10-04.18:50:19.345>
    actor = 'terry.reedy'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2019-10-04.12:03:40.486>
    closer = 'serhiy.storchaka'
    components = ['IDLE', 'Tkinter', 'Unicode']
    creation = <Date 2014-03-28.12:01:05.782>
    creator = 'malin'
    dependencies = []
    files = ['35929', '36080', '36082']
    hgrepos = []
    issue_num = 21084
    keywords = ['patch']
    message_count = 14.0
    messages = ['215038', '215039', '215040', '222817', '222834', '223007', '223843', '223846', '223848', '223912', '223915', '266443', '353933', '353969']
    nosy_count = 7.0
    nosy_names = ['terry.reedy', 'vstinner', 'ezio.melotti', 'THRlWiTi', 'serhiy.storchaka', 'malin', 'buhtz']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue21084'
    versions = ['Python 3.5', 'Python 3.6']

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Mar 28, 2014

    When open a file with characters above the range (U+0000-U+FFFF), IDLE quit without any report. For example, open this file \Lib\test\test_re.py

    The below is Traceback info, the last line tells the reason. I just hope IDLE say something before quit, so we can know what happend.

    I have checked Python 3.3.5 and 3.4.0, they have the same problem. I didn't find a 3.5 build, so I can't test this problem under 3.5.

    =============================================

    Exception in Tkinter callback
    Traceback (most recent call last):
      File "C:\Python33\lib\tkinter\__init__.py", line 1489, in __call__
        return self.func(*args)
      File "C:\Python33\lib\idlelib\IOBinding.py", line 186, in open
        flist.open(filename)
      File "C:\Python33\lib\idlelib\FileList.py", line 36, in open
        edit = self.EditorWindow(self, filename, key)
      File "C:\Python33\lib\idlelib\PyShell.py", line 126, in __init__
        EditorWindow.__init__(self, *args)
      File "C:\Python33\lib\idlelib\EditorWindow.py", line 288, in __init__
        if io.loadfile(filename):
      File "C:\Python33\lib\idlelib\IOBinding.py", line 236, in loadfile
        self.text.insert("1.0", chars)
      File "C:\Python33\lib\idlelib\Percolator.py", line 25, in insert
        self.top.insert(index, chars, tags)
      File "C:\Python33\lib\idlelib\UndoDelegator.py", line 81, in insert
        self.addcmd(InsertCommand(index, chars, tags))
      File "C:\Python33\lib\idlelib\UndoDelegator.py", line 116, in addcmd
        cmd.do(self.delegate)
      File "C:\Python33\lib\idlelib\UndoDelegator.py", line 219, in do
        text.insert(self.index1, self.chars, self.tags)
      File "C:\Python33\lib\idlelib\ColorDelegator.py", line 85, in insert
        self.delegate.insert(index, chars, tags)
      File "C:\Python33\lib\idlelib\WidgetRedirector.py", line 104, in __call__
        return self.tk_call(self.orig_and_operation + args)
    _tkinter.TclError: character U+1d518 is above the range (U+0000-U+FFFF) allowed by Tcl

    @animalize animalize mannequin added type-crash A hard crash of the interpreter, possibly with a core dump topic-IDLE labels Mar 28, 2014
    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Mar 28, 2014

    When open a file with characters above the range (U+0000-U+FFFF), IDLE quit without any report. For example, open this file C:\Python33\lib\test\test_re.py

    The below is Traceback info, the last line tells the reason. I just hope IDLE say something before quit, so we can know what happend.

    I have checked Python 3.3.5 and 3.4.0, they have the same problem. I didn't find a 3.5 build, so I can't test this problem under 3.5.

    =============================================

    Exception in Tkinter callback
    Traceback (most recent call last):
      File "C:\Python33\lib\tkinter\__init__.py", line 1489, in __call__
        return self.func(*args)
      File "C:\Python33\lib\idlelib\IOBinding.py", line 186, in open
        flist.open(filename)
      File "C:\Python33\lib\idlelib\FileList.py", line 36, in open
        edit = self.EditorWindow(self, filename, key)
      File "C:\Python33\lib\idlelib\PyShell.py", line 126, in __init__
        EditorWindow.__init__(self, *args)
      File "C:\Python33\lib\idlelib\EditorWindow.py", line 288, in __init__
        if io.loadfile(filename):
      File "C:\Python33\lib\idlelib\IOBinding.py", line 236, in loadfile
        self.text.insert("1.0", chars)
      File "C:\Python33\lib\idlelib\Percolator.py", line 25, in insert
        self.top.insert(index, chars, tags)
      File "C:\Python33\lib\idlelib\UndoDelegator.py", line 81, in insert
        self.addcmd(InsertCommand(index, chars, tags))
      File "C:\Python33\lib\idlelib\UndoDelegator.py", line 116, in addcmd
        cmd.do(self.delegate)
      File "C:\Python33\lib\idlelib\UndoDelegator.py", line 219, in do
        text.insert(self.index1, self.chars, self.tags)
      File "C:\Python33\lib\idlelib\ColorDelegator.py", line 85, in insert
        self.delegate.insert(index, chars, tags)
      File "C:\Python33\lib\idlelib\WidgetRedirector.py", line 104, in __call__
        return self.tk_call(self.orig_and_operation + args)
    _tkinter.TclError: character U+1d518 is above the range (U+0000-U+FFFF) allowed by Tcl

    @ezio-melotti
    Copy link
    Member

    See bpo-13153.

    @ezio-melotti ezio-melotti added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Mar 28, 2014
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 12, 2014

    Accidentally set to pending I take it.

    @serhiy-storchaka
    Copy link
    Member

    Yes, this is very similar to bpo-13153. Both these issues can have same solution or can have different solutions. This issue relates to more realistic situation and therefore is more important.

    Here is simple and almost working solution for this issue. Unfortunately it works incorrectly when astral characters are encountered in raw string literals. More mature solution should parse sources and convert raw string literals containing astral characters to non-raw string literals. But this will not work with invalid Python files and non-Python files.

    I afraid this issue has not perfect solution. The question is which imperfect solution and compromise we will decided enough acceptable.

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Jul 14, 2014

    I suggest don't change the content of file, just give a message such as:

    IDLE can't display non-BMP character (codepoint above 0xFFFF).
    A non-BMP character found in Line 23, position 8 of aaaa.py, please open this file with other editor.

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Jul 24, 2014

    I wrote this code, but I don't know how to make a patch.

    Insert these codes in C:\Python34\Lib\idlelib\IOBinding.py
    Around line 234, before this line:
    self.text.delete("1.0", "end")

            # check non-bmp characters
            line_count = 1
            position_count = 1
            for char in chars:
                if char == '\n':
                    line_count += 1
                    position_count = 1
                if ord(char) > 0xFFFF:
                    nonbmp_msg = ("IDLE can't display non-BMP characters "
                                  "(codepoint above 0xFFFF).\n"
                                  "A non-BMP character found at line %d, "
                                  "position %d of file %s, codepoint 0x%X.\n"
                                  "Please open this file with another editor.")
                    tkMessageBox.showerror("non-BMP character",
                                            nonbmp_msg %
                                           (line_count, position_count,
                                            filename, ord(char)),
                                           parent=self.text)
                    return False
                position_count += 1

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Jul 24, 2014

    Changing the second "if" to "elif" is better.

    I'm sorry, I have never submitted patch.
    If somebody gives a hand, feel free to modify those codes.

    @ezio-melotti
    Copy link
    Member

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Jul 25, 2014

    Feel free to modify this patch.

    @animalize
    Copy link
    Mannequin Author

    animalize mannequin commented Jul 25, 2014

    nonbmp_except_check_v2.patch changes character numbers to 0-based, same as IDLE.

    Quote from www.tkdocs.com :
    "for historical conventions related to how programmers normally refer to lines and characters, line numbers are 1-based, and character numbers are 0-based."

    @terryjreedy
    Copy link
    Member

    Tk Text (and other widgets, but Text is the main issue) has two display problems: astral chars and long lines (over a thousand chars, say). These problems can manifest in various places: file names, shell input (keyboard or clipboard), shell output, editor input (keyboard, clipboard, or file). IDLE needs to take more control over what is displayed to work around both problems.

    Tk Text also has a display feature: substring tagging. I have been heistant to simple replace astral chars with their \U000hhhhh expansion because of the aliasing problem: in shell output, for instance, the user would not know if the program wrote 1 char or 10. It would also be impossible to know if a reverse transformation might be needed. Tagging astral expansions would solve both problems.

    import re
    
    astral = re.compile(r'([^\x00-\uffff])')
    s = 'X\U00011111Y\U00011112\U00011113Z'
    for i, ss in enumerate(re.split(astral, s)):
        if not i%2:
            print(ss, end='')
        else:
            print(r'\\U%08x' % ord(ss), end='')
    # prints
    X\\U00011111Y\\U00011112\\U00011113Z

    Now replace print with test.insert, with an 'astral' tag for the second. tk will not double '\'s. Astral tag could switch, for instance, to underline version of current font. This should work with any color scheme.

    [Separate but related issue: augment Format or context menu with functions to convert between literal char, escape string, and name representation (using unicodedatabase).]

    @serhiy-storchaka
    Copy link
    Member

    Fixed by PR 16545 (see bpo-13153).

    @terryjreedy
    Copy link
    Member

    As noted on bpo-13153, files with astral chars can now be read without an exception, but the presence of astral chars messes up editing text that follows at least on the same line by misplacing the cursor. I will open a new issue about replacing such with \U escapes.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @pippim
    Copy link

    pippim commented Jun 17, 2023

    Tk Text (and other widgets, but Text is the main issue) has two display problems: astral chars and long lines (over a thousand chars, say). These problems can manifest in various places: file names, shell input (keyboard or clipboard), shell output, editor input (keyboard, clipboard, or file). IDLE needs to take more control over what is displayed to work around both problems.

    Tk Text also has a display feature: substring tagging. I have been heistant to simple replace astral chars with their \U000hhhhh expansion because of the aliasing problem: in shell output, for instance, the user would not know if the program wrote 1 char or 10. It would also be impossible to know if a reverse transformation might be needed. Tagging astral expansions would solve both problems.

    import re
    
    astral = re.compile(r'([^\x00-\uffff])')
    s = 'X\U00011111Y\U00011112\U00011113Z'
    for i, ss in enumerate(re.split(astral, s)):
        if not i%2:
            print(ss, end='')
        else:
            print(r'\\U%08x' % ord(ss), end='')
    # prints
    X\\U00011111Y\\U00011112\\U00011113Z

    Now replace print with test.insert, with an 'astral' tag for the second. tk will not double ''s. Astral tag could switch, for instance, to underline version of current font. This should work with any color scheme.

    [Separate but related issue: augment Format or context menu with functions to convert between literal char, escape string, and name representation (using unicodedatabase).]

    I've been using a version of this patch for years. Today a tk.Entry field could not be inserted into a ttk.Treeview line and the trusted old patch inserted "?" instead of "y". After more testing the characters "vwxyz" were all illegal. So I changed the patch. See the code below for the patch plus results when patch isn't used:

    def normalize_tcl(s):
        """
    
            Used by bserve.py and maybe mserve.py in future.
    
            Fixes error:
    
              File "/usr/lib/python2.7/lib-tk/ttk.py", line 1339, in insert
                res = self.tk.call(self._w, "insert", parent, index, *opts)
            _tkinter.TclError: character U+1f3d2 is above the 
                range (U+0000-U+FF FF) allowed by Tcl
    
            From: https://bugs.python.org/issue21084
    
        """
        astral = re.compile(r'([^\x00-\uffff])')
        new_s = ""
        for i, ss in enumerate(re.split(astral, s)):
            if not i % 2:
                new_s += ss
            # Patch June 17, 2023 for test results published below
            elif ss == "v":
                new_s += u"v"
            elif ss == "w":
                new_s += u"w"
            elif ss == "x":
                new_s += u"x"
            elif ss == "y":
                new_s += u"y"
            elif ss == "z":
                new_s += u"z"
            # end of June 17, 2023 patch
            else:
                new_s += '?'
    
        return new_s
    
    
    '''
    TclError: character U+1f3b5 is above the range (U+0000-U+FFFF) allowed by Tcl
    Results prior to patch made June 17, 2023. Note sometimes you can avoid using
    noramlize_tcl() function by using .encode('utf-8') See "Rainy Days" playlist
    name handling in 'mserve.py build_lib_top_playlist_name()' function.
    '''
    test = "abcdefghijklmnopqrstuvwxyz"
    result = normalize_tcl(test)
    print("test  :", test)      # test  : abcdefghijklmnopqrstuvwxyz
    print("result:", result)    # result: abcdefghijklmnopqrstu?????

    On a side note, sometimes normalize_tcl() isn't needed. All you need is tk_entry_var.encode('utf-8') and the letters v, w, x, y and z are automatically normalized for tcl. Kind of mind-boggling how tcl can create illegal characters using tk.Entry(...). I believe. tk_entry_var.set() and tk_entry_var.get() are used correctly

    @vstinner
    Copy link
    Member

    Please don't comment closed issues, but open a new issue.

    @terryjreedy
    Copy link
    Member

    @pippim To refer to a previous comment, click the ... on its title bar and then on 'Copy link'. Your claim in #65283 (comment) that some ascii letters are illegal is in general wrong. If you are having a real problem, ask for help here. Do NOT open an issue here on this tracker.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-IDLE topic-tkinter topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants