Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDLE: checksyntax() doesn't support Unicode? #48258

Closed
vstinner opened this issue Oct 1, 2008 · 16 comments
Closed

IDLE: checksyntax() doesn't support Unicode? #48258

vstinner opened this issue Oct 1, 2008 · 16 comments
Labels
release-blocker topic-IDLE type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@vstinner
Copy link
Member

vstinner commented Oct 1, 2008

BPO 4008
Nosy @loewis, @terryjreedy, @vstinner
Files
  • idle-3.0rc1-quits-when-run.py
  • idle_encoding-3.patch: Use tokenize.detect_encoding() to detect Python script encoding
  • iso.py: Example of non-utf8 file (coding: ISO-8859-1)
  • idle_encoding_4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-01-18.20:18:13.474>
    created_at = <Date 2008-10-01.15:37:54.392>
    labels = ['expert-IDLE', 'type-crash', 'release-blocker']
    title = "IDLE: checksyntax() doesn't support Unicode?"
    updated_at = <Date 2009-01-18.20:18:13.473>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2009-01-18.20:18:13.473>
    actor = 'loewis'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-01-18.20:18:13.474>
    closer = 'loewis'
    components = ['IDLE']
    creation = <Date 2008-10-01.15:37:54.392>
    creator = 'vstinner'
    dependencies = []
    files = ['11672', '11682', '11694', '12486']
    hgrepos = []
    issue_num = 4008
    keywords = ['patch', 'needs review']
    message_count = 16.0
    messages = ['74131', '74134', '74160', '74161', '74197', '74202', '74207', '74210', '74280', '74303', '74312', '76052', '76579', '78479', '78933', '80119']
    nosy_count = 4.0
    nosy_names = ['loewis', 'terry.reedy', 'vstinner', 'geon']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue4008'
    versions = ['Python 3.0']

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 1, 2008

    IDLE checksyntax() function doesn't support Unicode. Example with
    idle-3.0rc1-quits-when-run.py in an ASCII terminal:

    $ ./python Tools/scripts/idle
    Exception in Tkinter callback
    Traceback (most recent call last):
      File "/home/haypo/prog/py3k/Lib/tkinter/__init__.py", line 1405, in 
    __call__
        return self.func(*args)
      File "/home/haypo/prog/py3k/Lib/idlelib/ScriptBinding.py", line 124, 
    in run_module_event
        code = self.checksyntax(filename)
      File "/home/haypo/prog/py3k/Lib/idlelib/ScriptBinding.py", line 86, 
    in checksyntax
        source = f.read()
      File "/home/haypo/prog/py3k/Lib/io.py", line 1719, in read
        decoder.decode(self.buffer.read(), final=True))
      File "/home/haypo/prog/py3k/Lib/io.py", line 1294, in decode
        output = self.decoder.decode(input, final=final)
      File "/home/haypo/prog/py3k/Lib/encodings/ascii.py", line 26, in 
    decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 
    87: ordinal not in range(128)

    To open an ASCII terminal on Linux, you can for example use xterm with
    an empty environment (except DISPLAY and HOME variables): "env -i
    DISPLAY=$DISPLAY HOME=$HOME xterm".

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 1, 2008

    Hum, the problem is that IDLE asks io.open() to detect the charset
    whereas open() doesn't know the #coding: header. So if your locale is
    ASCII, CP1252 or anything different of UTF-8, read the file will
    fails.

    I wrote a patch to detect the encoding. Python code (detect_encoding()
    function) is based on PyTokenizer_FindEncoding() and get_coding_spec()
    (from Parser/tokenizer.c). There is no existing Python function to
    detect the encoding of a Python script? (a public function available
    in a Python script)

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 2, 2008

    Ah! tokenize has already a method detect_encoding(). My new patch uses
    it to avoid code duplication.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 2, 2008

    Notice that there is also IOBinding.coding_spec. Not sure whether this
    or the one in tokenize is more correct.

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 2, 2008

    loewis wrote:

    Notice that there is also IOBinding.coding_spec.
    Not sure whether this or the one in tokenize is more correct.

    Oh! IOBinding reimplement many features now available in Python like
    universal new line or function to write unicode strings to a file. But
    I don't want to rewrite IDLE, I just want to fix the initial problem:
    IDLE is unable to open a non-ASCII file using "#coding:" header.

    So IDLE reimplemented coding detection twice: once in IOBinding and
    once in ScriptBinding. So I wrote a new version of my patch removing
    all the code to reuse tokenize.detect_encoding().

    I changed IDLE behaviour: IOBinding._decode() used the locale encoding
    if it's unable to detect the encoding using UTF-8 BOM and/or if the
    #coding: header is missing. Since I also read "Finally, try the
    locale's encoding. This is deprecated", I prefer to remove it. If you
    want to keep the current behaviour, use:
    -------------------------

    def detect_encoding(filename, default=None):
        with open(filename, 'rb') as f:
            encoding, line = tokenize.detect_encoding(f.readline)
        if (not line) and default:
            return default
        return encoding
    ...
                encoding = detect_encoding(filename, locale_encoding)

    Please review and test my patch (which becomes longer and longer) :-)

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 2, 2008

    Oh! IOBinding reimplement many features now available in Python like
    universal new line or function to write unicode strings to a file.

    It did not *re*implement. The implementation in IOBinding predates all
    other implementations out there.

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 2, 2008

    @loewis: Ok, I didn't know. I think that it's better to reuse existing
    code.

    I also compared the implementations of encoding detection, and the
    code looks the same in IDLE and tokenize, but I prefer tokenize.
    tokenize.detect_encoding() has longer documentation, return the line
    (decoded as Unicode) matching the encoding cookie, and look to be more
    robust.

    I saw an interesting test in IDLE code: it checks the charset. So I
    wrote a patch raising a SyntaxError for tokenize: bpo-4021.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 2, 2008

    I can't reproduce the problem. It works fine for me, displaying the box
    drawing character. In case it matters, sys.getpreferredencoding returns
    'ANSI_X3.4-1968'; this is on Linux, idle started from an xterm, r66761

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 3, 2008

    @loewis: I guess that your locale is still UTF-8.

    On Linux (Ubuntu Gutsy) using "env -i DISPLAY=$DISPLAY HOME=$HOME
    xterm" to get a new empty environment, I get:

    $ locale
    LANG=
    LC_ALL=
    LC_CTYPE="POSIX"
    LC_NUMERIC="POSIX"
    LC_TIME="POSIX"
    LC_COLLATE="POSIX"
    ...
    $ python3.0
    >>> from idlelib.IOBinding import encoding
    >>> encoding 
    'ansi_x3.4-1968'
    >>> import locale
    >>> locale.getdefaultlocale()
    (None, None)
    >>> locale.nl_langinfo(locale.CODESET)
    'ANSI_X3.4-1968'

    In this environment, IDLE is unable to detect
    idle-3.0rc1-quits-when-run.py encoding.

    IDLE uses open(filename, 'r'): it doesn't specify the charset. In this
    case, TextIOWrapper uses locale.getpreferredencoding() as encoding (or
    ASCII on failure).

    To sum IDLE: if your locale is UTF-8, you will be able to open an
    UTF-8 file. So for example, if your locale is UTF-8, you won't be able
    to open an ISO-8859-1 file. Let's try iso.py: IDLE displays the
    error "Failed to decode" and quit whereas I specified the encoding :-/

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 4, 2008

    @loewis: I guess that your locale is still UTF-8.

    To refute this claim, I reported that locale.getpreferredencoding
    reports 'ANSI_X3.4-1968'. I was following your instructions exactly
    (on Debian 4.0), and still, it opens successfully (when loaded through
    File/Open). Should I do something else with it to trigger the error,
    other than opening it?

    When opening iso.py, I get a pop window titled "Decoding error",
    with a message "Failed to Decode". This seems to be correct also.

    So I still can't reproduce the problem.

    I don't understand why you say that IDLE uses open(filename, 'r').
    In IOBinding.IOBinding.loadfile, I see

                # open the file in binary mode so that we can handle
                # end-of-line convention ourselves.
                f = open(filename,'rb')

    @vstinner
    Copy link
    Member Author

    vstinner commented Oct 4, 2008

    IDLE opens the script many than once. There are two cases:
    (1) first open when IDLE read the file content to display it
    (2) second open on pressing F5 key (Run Module) to check the syntax

    (1) uses IOBinding and fails to open ISO-8859-1 file with UTF-8
    locale.

    (2) uses ScriptBinding and fails to open UTF-8 file with ASCII locale.

    About the initial problem (idle-3.0rc1-quits-when-run.py), yes, I
    forgot to say that you have to run the module, sorry :-/

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 19, 2008

    This patch has two problems:

    1. saving files fails, since there is still a call left to the function
      coding_spec, but that function is removed.
    2. if saving would work: it doesn't preserve the line endings of the
      original file when writing it back. If you open files with DOS line
      endings on Unix, upon saving, they should still have DOS line endings.

    @terryjreedy
    Copy link
    Member

    This is still a problem on my WinXP 3.0rc3 with
    # -- coding: utf-8 --
    in a file but not with the same pasted directly into the shell Window.

    @terryjreedy terryjreedy added the type-crash A hard crash of the interpreter, possibly with a core dump label Nov 29, 2008
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 29, 2008

    Here is a new patch that fixes this issue, and the duplicate issues
    (bpo-4410, and bpo-4623).

    It doesn't try to eliminate code duplication, but fixes coding_spec by
    decoding always to Latin-1 first until the coding is known. It fixes
    check_syntax by opening the source file in binary. It should have fixed
    tabnanny the same way, except that tabnanny cannot properly process byte
    tokens.

    @loewis loewis mannequin added the release-blocker label Dec 29, 2008
    @geon
    Copy link
    Mannequin

    geon mannequin commented Jan 3, 2009

    I vote for fixing this too. This might be simplified/another example of
    above mentioned issues:

    # -- coding: utf-8 --
    print ("ěščřžýáíé")

    in IDLE prints this:
    >>> 
    ěščřžýáíé

    When running this script under python command line from another editor,
    I get the output readable as expected.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 18, 2009

    Committed as r68730 and r68731.

    @loewis loewis mannequin closed this as completed Jan 18, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker topic-IDLE type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants