Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement utf-8-bmp codec #58512

Closed
asvetlov opened this issue Mar 14, 2012 · 30 comments
Closed

Implement utf-8-bmp codec #58512

asvetlov opened this issue Mar 14, 2012 · 30 comments
Assignees
Labels
3.7 (EOL) end of life topic-IDLE topic-tkinter type-bug An unexpected behavior, bug, or error

Comments

@asvetlov
Copy link
Contributor

BPO 14304
Nosy @loewis, @terryjreedy, @abalkin, @pitrou, @ezio-melotti, @serwy, @asvetlov, @serhiy-storchaka
Files
  • idle_escape_nonbmp.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/asvetlov'
    closed_at = <Date 2020-06-07.22:58:51.599>
    created_at = <Date 2012-03-14.21:01:10.188>
    labels = ['3.7', 'expert-IDLE', 'type-bug', 'expert-tkinter']
    title = 'Implement utf-8-bmp codec'
    updated_at = <Date 2020-06-07.23:13:21.632>
    user = 'https://github.com/asvetlov'

    bugs.python.org fields:

    activity = <Date 2020-06-07.23:13:21.632>
    actor = 'vstinner'
    assignee = 'asvetlov'
    closed = True
    closed_date = <Date 2020-06-07.22:58:51.599>
    closer = 'terry.reedy'
    components = ['IDLE', 'Tkinter']
    creation = <Date 2012-03-14.21:01:10.188>
    creator = 'asvetlov'
    dependencies = []
    files = ['25244']
    hgrepos = []
    issue_num = 14304
    keywords = ['patch']
    message_count = 30.0
    messages = ['155793', '157235', '157248', '157263', '158372', '158424', '158426', '158460', '158467', '158470', '158486', '158487', '159497', '159530', '159531', '159538', '159541', '159543', '159544', '159545', '159546', '159547', '159582', '163745', '228168', '228175', '228182', '228183', '296687', '370920']
    nosy_count = 9.0
    nosy_names = ['loewis', 'terry.reedy', 'belopolsky', 'pitrou', 'ezio.melotti', 'roger.serwy', 'Arfrever', 'asvetlov', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue14304'
    versions = ['Python 3.6', 'Python 3.7']

    @asvetlov
    Copy link
    Contributor Author

    Tkinter (and IDLE specially) can use only UCS-2 characters.
    In PyShell IDLE tries to escape non-ascii.
    To better result we should to escape only non-BMP chars leaving BMP characters untouched.

    @asvetlov asvetlov self-assigned this Mar 14, 2012
    @pitrou
    Copy link
    Member

    pitrou commented Mar 31, 2012

    The solution outlined in the issue title ("utf-8-bmp codec") sounds like a rather dubious idea.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 1, 2012

    pitrou: can you elaborate?

    @serhiy-storchaka
    Copy link
    Member

    ''.join(c if ord(c) < 0x10000 else escape(c) for c in s)

    @vstinner
    Copy link
    Member

    What is this codec? What do you mean by "escpe non-ascii"?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 16, 2012

    This codec is one that is equal to UTF-8, but restricted to the BMP. For non-BMP character, the error handler is called. It will be the stdout codec for the IDLE interactive shell, causing non-BMP results to be ascii() escaped.

    @asvetlov
    Copy link
    Contributor Author

    Tkinter (as Tcl itself) has no support of non-BMP characters in any form.
    It looks like support of UTF-16 without surrogates.
    I like to implement codec for that which will process different error modes (strict, replace, ignore etc) as well as others codecs does.

    It will allow to support BMP well and control processing of non-BMP in IDLE.

    About your second question.
    IDLE has interactive shell. This shell in REPL will try to print expression result. It it contains non-BMP whole result is converted to ASCII with escaping. It's different from standard python console. From my perspective expected behavior is to pass BMP chars and escape only non-BMP.

    @serhiy-storchaka
    Copy link
    Member

    Example:

    >>> '\u0100'
    'Ā'
    >>> '\u0100\U00010000'
    '\u0100\U00010000'
    >>> print('\u0100')
    Ā
    >>> print('\u0100\U00010000')
    Traceback (most recent call last):
      File "<pyshell#33>", line 1, in <module>
        print('\u0100\U00010000')
    UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 1-1: Non-BMP character not supported in Tk

    But I think that it is too specific problem and too specific solution. It would be better if IDLE itself escapes the string in the most appropriate way.

    def utf8bmp_encode(s):
        return ''.join(c if ord(c) <= 0xffff else '\\U%08x' % ord(c) for c in s).encode('utf-8')

    or

    def utf8bmp_encode(s):
        return re.sub('[^\x00-\uffff]', lambda m: '\\U%08x' % ord(m.group()), s).encode('utf-8')

    @asvetlov
    Copy link
    Contributor Author

    The way is named 'codec'.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 16, 2012

    But I think that it is too specific problem and too specific
    solution. It would be better if IDLE itself escapes the string in the
    most appropriate way.

    That is not implementable correctly. If you think otherwise, please
    submit a patch. If not, please trust me on that judgment.

    @serhiy-storchaka
    Copy link
    Member

    May be I did not correctly understand the problem, but I can assume,
    that this patch solves it.

    'Агов!\U00010000'

    @serhiy-storchaka
    Copy link
    Member

    Sorry, the mail daemon has eaten a piece of example.

    >>> '\u0410\u0433\u043e\u0432!\U00010000'
    'Агов!\U00010000'

    @serhiy-storchaka
    Copy link
    Member

    Andrew, the patch solves your issue?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 28, 2012

    The patch is incorrect, i.e. it deviates from what the command line interface does. When you try to write to sys.stdout, and the characters are not supported you get UnicodeError. Only when it is interactive mode, and tries to represent some result, ascii escaping happens.

    @serhiy-storchaka
    Copy link
    Member

    I don't see what the patch worse than the current behavior.

    Unpatched:
    >>> ''.join(map(chr, [76, 246, 119, 105, 115]))
    'Löwis'
    >>> ''.join(map(chr, [76, 246, 119, 105, 115, 65536]))
    'L\xf6wis\U00010000'
    
    Patched:
    >>> ''.join(map(chr, [76, 246, 119, 105, 115]))
    'Löwis'
    >>> ''.join(map(chr, [76, 246, 119, 105, 115, 65536]))
    'Löwis\U00010000'

    In the case of the Cyrillic alphabet all text becomes unreadable, if there are some non-bmp characters in it.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 28, 2012

    In the case of the Cyrillic alphabet all text becomes unreadable, if
    there are some non-bmp characters in it.

    And indeed, that's the correct, desired behavior, as it models what the
    interactive shell does.

    If you want to change this, you need to also change the interactive console,
    which is an issue independent of this one.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 28, 2012

    I take that back; the interactive shell uses the backslashescape error handler.

    Still, I don't think IDLE should setup a displayhook in the first place. What if an application replaces the displayhook?

    @serhiy-storchaka
    Copy link
    Member

    Still, I don't think IDLE should setup a displayhook in the first place. What if an application replaces the displayhook?

    IDLE *is* the application.

    If another application that uses the idlelib, replace displayhook, it
    must itself to worry about the correct encoding and escaping.

    @asvetlov
    Copy link
    Contributor Author

    Serhiy, I like to fix tkinter itself, not only IDLE.
    There are other problems like idle is crashing if non-bmp char will be pasted from clipboard.
    Moreover, non-bmp behavior is different from one Tk widget to other.
    I still want to make codec for it and then try to solve tk problems.
    Maybe solution will force to extend tkinter interface for process codec errors with reasonable well specified default behavior.
    Sorry for my silence. I hope to make some progress next weeks.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 28, 2012

    IDLE *is* the application.

    No, IDLE is the development environment. The application is
    whatever is being developed with IDLE.

    @serhiy-storchaka
    Copy link
    Member

    I don't understand how the utf-8-bmp codec will help to fix the tkinter. To fix the tkinter, you need to fix the Tcl/Tk, but it is outside of Python. While Tcl does not support non-bmp characters, correct and non-ambiguous working with non-bmp characters is not possible. You should choose the method of encoding of non-bmp characters and these methods will be different for different applications.

    @serhiy-storchaka
    Copy link
    Member

    No, IDLE is the development environment. The application is
    whatever is being developed with IDLE.

    If the application replaces the displayhook, than it is the development
    environment too.

    @serhiy-storchaka
    Copy link
    Member

    Andrew, imagine that the utf-8-bmp codec is already there (I will do it
    for you, if I see its necessity). How are you going to use it? Show a
    patch that fixes IDLE and tkinter using this codec. It seems to me that
    any result can be achieved without the codec, and not higher cost. And
    that's not counting cost of the codec itself.

    @serhiy-storchaka
    Copy link
    Member

    Any chance to commit the patch before final feature freeze?

    @terryjreedy
    Copy link
    Member

    Pending doing some experiments with current and patched code, and reading the rpc code, I believe I would like to see the patch applied. I don't care about whether the patch defines a 'codec' or what its name would be. What i do want is for the Idle Shell to display unicode strings produced by python code as faithfully as possible, without raising an exception, given the limitations of tk and the selected font.

    @terryjreedy terryjreedy added the type-bug An unexpected behavior, bug, or error label Oct 2, 2014
    @vstinner
    Copy link
    Member

    vstinner commented Oct 2, 2014

    Tkinter (and IDLE specially) can use only UCS-2 characters.

    Is it always the case, or does depend on a compilation flag of Tcl or Tk?

    @serhiy-storchaka
    Copy link
    Member

    In theory Tcl/Tk can be built with 32-bit Tcl_Char. But I doubt that this option is well tested. In any case on Linux Python depends on system Tcl/Tk.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 2, 2014

    In theory Tcl/Tk can be built with 32-bit Tcl_Char.

    Would it make sense to compile Tcl/Tk with 32-bit Tcl_Char on Windows? I think that we embed our own build ot Tcl/Tk, right?

    @terryjreedy
    Copy link
    Member

    In 3.6, Python's use of the Windows console was changed to work much better with unicode. As a result, IDLE is now worse rather than better than the console on Windows. I plan to do something before 3.7.0.

    @terryjreedy
    Copy link
    Member

    October 2019, Serhiy solved the display issue with a _tkinter patch for python/cpython#57362.
    bpo-13153: Use OS native encoding for converting between Python and Tcl. (GH-16545)
    https://github.com/python/cpython/commit/06cb94bc8419b9a24df6b0d724fcd8e40c6971d6
    In Windows IDLE Shell
    >>> ''.join(map(chr, [76, 246, 119, 105, 115, 0x1F40D]))
    'Löwis🐍'
    except that the snake is black and white.  (Many astral chars have no glyph and appear as a box.) In console REPL, the snake shows as box box space box.

    Pasting astral characters into edited code 'works' except that editing following code is messy because the astral char is multiple chars internally and the visible cursor no longer matches the internal index. (But pasting such no longer crashes IDLE.)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life topic-IDLE topic-tkinter type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants