New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle shell crash on printing non-BMP unicode character #58408
Comments
Hi, Python 3.3.0a1 (default, Mar 4 2012, 17:27:59) [MSC v.1500 32 bit (Intel)] on win32 ... >>> got_ahsa = "\N{GOTHIC LETTER AHSA}"
>>> len(got_ahsa)
1
>>> got_ahsa.encode("unicode-escape")
b'\\U00010330'
>>> got_ahsa [crash - idle shell window closes immediately without any visible error message or traceback] I realised later, that tkinter probably won't be able to print wide-unicode characters anyway (according to Regards |
Hi Vlastimil, Can you repeat your test case while running IDLE from the command prompt and report the error you see?
IDLE closes suddenly on Windows because IDLE uses pythonw.exe which has no stdout or stderr. When Tkinter encounters an error and tries to write to stderr, an error is raised in the Tkinter eventloop and the eventloop terminates. |
Hi, Python 3.3.0a1 (default, Mar 4 2012, 17:27:59) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> got_ahsa = "\N{GOTHIC LETTER AHSA}"
>>> len(got_ahsa)
1
>>> got_ahsa.encode("unicode-escape")
b'\\U00010330'
>>> got_ahsa
I just get empty line as "answer" but no crash. The console indeed contains the traceback with the error I expected vbr ============ Microsoft Windows XP [Verze 5.1.2600] C:\Python33>python.exe -m idlelib.idle Object: stdout Traceback (most recent call last):
File "C:\Python33\lib\idlelib\rpc.py", line 188, in localcall
ret = method(*args, **kwargs)
File "C:\Python33\lib\idlelib\PyShell.py", line 1244, in write
self.shell.write(s, self.tags)
File "C:\Python33\lib\idlelib\PyShell.py", line 1226, in write
OutputWindow.write(self, s, tags, "iomark")
File "C:\Python33\lib\idlelib\OutputWindow.py", line 40, in write
self.text.insert(mark, s, tags)
File "C:\Python33\lib\idlelib\Percolator.py", line 25, in insert
self.top.insert(index, chars, tags)
File "C:\Python33\lib\idlelib\ColorDelegator.py", line 80, in insert
self.delegate.insert(index, chars, tags)
File "C:\Python33\lib\idlelib\PyShell.py", line 322, in insert
UndoDelegator.insert(self, index, chars, tags)
File "C:\Python33\lib\idlelib\UndoDelegator.py", line 81, in insert
self.addcmd(InsertCommand(index, chars, tags))
File "C:\Python33\lib\idlelib\UndoDelegator.py", line 116, in addcmd
cmd.do(self.delegate)
File "C:\Python33\lib\idlelib\UndoDelegator.py", line 219, in do
text.insert(self.index1, self.chars, self.tags)
File "C:\Python33\lib\idlelib\ColorDelegator.py", line 80, in insert
self.delegate.insert(index, chars, tags)
File "C:\Python33\lib\idlelib\WidgetRedirector.py", line 104, in __call__
return self.tk_call(self.orig_and_operation + args)
ValueError: character U+10330 is above the range (U+0000-U+FFFF) allowed by Tcl |
On 3.2.2, Win7, the length is 2 and printing in Idle prints a square, as it usually does for chars it cannot print. I presume Tk recognizes surrogate pairs. Printing to the screen should not raise an exception, so the square would be better. Even better would be to do what the 3.2 and 3.3 Command Prompt Interpreters do, which is to print an evaluable representation: >>> c
'\U00010330' I assume that this string is produced by python.exe rather than Windows. If so, neither of the two pythonw processes is currently doing the same conversion. My understanding is that the user pythonw process uses idlelib.rpc.RPCproxy objects to ship i/o calls to the idle pythonw process. I presume we could find the idle process window .write methods and change lines like --- When I paste the same into the 3.2 or 3.2 interpreter, it is converted to ascii '?'. I presume this is done by Windows Command Prompt before sending anything to python. |
I'd like to add some further observations to the mentioned issue; I also tried to explicitly display surrogate pair, which were used automatically until python 3.2; these can be used in tkinter in 3.3, but there are limitations and discrepancies: >>>
>>> got_ahsa = "\N{GOTHIC LETTER AHSA}"
>>> def wide_char_to_surrog_pair(char):
code_point = ord(char)
if code_point <= 0xFFFF:
return char
else:
high_surr = (code_point - 0x10000) // 0x400 + 0xD800
low_surr = (code_point - 0x10000) % 0x400 + 0xDC00
return chr(high_surr)+chr(low_surr)
>>> ahsa_surrog = wide_char_to_surrog_pair(got_ahsa)
>>> print(ahsa_surrog)
𐌰
>>> repr(ahsa_surrog)
"'_ud800\x00udf30'"
>>> ahsa_surrog
'Pud800 udf30' [the space in the middle of the last item might be \x00, as it terminates the clipboard content, the rest is copied separately] the printed square corresponds with the given character and can be used in other programs etc. (whereas in py 3.2, the same value was used for repr and a direct "display" of the string in the interpreter, there are three different formats in py 3.3. I also noticed that surogate pair is not supported as input for unicodedata.name(...) anymore:
>>> import unicodedata
>>> unicodedata.name(ahsa_surrog)
Traceback (most recent call last):
File "<pyshell#60>", line 1, in <module>
unicodedata.name(ahsa_surrog)
TypeError: need a single Unicode character as parameter
>>> (in 3.2 and probably others it returns the expected 'GOTHIC LETTER AHSA') (I for my part would think, that e.g. keeping a bit liberal (but still non-ambiguous) input possibilities for unicodedata wouldn't hurt. Also, if tkinter is not going to support wide unicode natively any time soon, the output conversion using surrogates, which are also understandable for other programs, seems the most usable option in this regard. Hopefully, this is somehow relevant for the original issue - regards, |
Vlastimil: you are mixing issues. Some of your observations are actually correct behaviour; please don't clutter the report with that, but report each separate behavior in a separate report. In Python 3.3, surrogate pairs do *not* substitute for the the actual character, since the internal representation is not UTF-16 anymore. Also, when you run a Tkinter app in IDLE: while you get a "proper" traceback output, your conclusion that python.exe does not "crash" is incorrect: it crashes just in the very same way that IDLE crashes. Except when run inside IDLE, it is a subprocess that "crashes" (i.e. terminates with a traceback output), not IDLE itself. |
Sorry for mixing the different problems, these were somehow things I noticed "at once" in the new python version, but I should have noticed the different domains myself. Now I see, that with pythonw a tk app terminates with the first exception (in general) in py 3.3 and also 3.2 (as opposed to py 2.7, where it just swallows the exception and stays alive, as one would probably expect). Should this be reported in a separate issue, or is this what remains relevant in *this* report? (Sorry for the confusion.) vbr |
That pythonw suddenly closes is a separate issue: if pythonw attempts to write to stderr, it crashes. To get your example to "run" in pythonw.exe, pythonw.exe Lib\idlelib\idle.py 2> out.txt I think the behavior of pythonw terminating when it can't write to stderr is actually correct: an exception is raised on attempting to write to stderr, which then can be printed (because there is no stderr). So the real fault here is the traceback that python.exe reports. To fix this, I think rpc.py should learn to marshal exceptions back to the subprocess. Then the initial sys.stdout.write should raise a UnicodeError (which it currently doesn't, either). This would get into the displayhook, which would then run use sys_displayhook_unencodable to backslashescape the unsupported character. I'll attach a patch that at least makes the exception UnicodeEncodeError. |
Attached is a patch to have the rpc marshal exceptions. When used with Martin's patch, IDLE returns >>> '\U00010330'
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
'\U00010330'
ValueError: character U+10330 is above the range (U+0000-U+FFFF) allowed by Tcl Martin: I disagree with the approach of raising a UnicodeEncodeError if IDLE can't render the output of a user's program, especially when the program would otherwise run without error if ran from outside of IDLE. Would replacing these characters with "?" and documenting this limitation in IDLE's docs be an acceptable solution? |
I made a mistake in msg155410. The results in the message are WITHOUT "unicodeerror.diff" applied. When it is applied, the IDLE shell gives: >>> '\U00010330'
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
'\U00010330'
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 1-1: Non-BMP character not supported in Tk
Traceback (most recent call last):
** IDLE Internal Exception:
File "idlelib/run.py", line 98, in main
ret = method(*args, **kwargs)
File "idlelib/run.py", line 305, in runcode
print_exception()
File "idlelib/run.py", line 168, in print_exception
print(line, end='', file=efile)
File "idlelib/rpc.py", line 599, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "idlelib/rpc.py", line 214, in remotecall
return self.asyncreturn(seq)
File "idlelib/rpc.py", line 245, in asyncreturn
return self.decoderesponse(response)
File "idlelib/rpc.py", line 265, in decoderesponse
raise what
ValueError: max() arg is an empty sequence I will need to rework the rpc_marshal_exception patch. |
This is really an independent issue, and I'd appreciate if people would So if the RPC exception marshalling works, and can resolve this issue, I don't *quite* understand what you are proposing. If it is that If you are suggesting that IDLE's write function should write |
Having had some time to work on it, the bug is in the unicodeerror.diff patch. If the string is empty then max(s) will raise a ValueError. This is easy to trigger by generating an exception at the python prompt, like "1/0". Attached is a revised version of Martin's patch. |
Martin, I got your message after I submitted the last one. This issue does involve IDLE crashing, but it's not crashing due to non-BMP characters. That is a side-effect of a bigger issue with pythonw.exe. See bpo-13582 for more information. IDLE's shell output has a gross deficiency due to Tkinter's inability to handle Unicode properly. Why penalize a program for running in IDLE just because IDLE can't write something to the text widget? This is precisely what your approach is doing - making IDLE an even more restricted environment than it needs to be. |
New changeset c06b94c5c609 by Andrew Svetlov in branch 'default': |
Patch escapes avery non-ascii char while better to escape only non-BMP. Will be done after bpo-14304 |
Andrew, please reopen this issue. Your committed patch does not work if IDLE is not using the subprocess. >>> got_ahsa = "\N{GOTHIC LETTER AHSA}"
>>> got_ahsa
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
got_ahsa
File "idlelib/PyShell.py", line 1255, in write
return self.shell.write(s, self.tags)
File "idlelib/PyShell.py", line 1233, in write
'Non-BMP character not supported in Tk')
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 1-1: Non-BMP character not supported in Tk However, it does work when IDLE uses a subprocess. |
Attached is a patch to undo Andrew's and fixes the issue in a simple manner. The tcl_unicode_range.patch from bpo-12342 has already been applied, so catching ValueError within IDLE is all that is now needed. |
Attached is a better implementation of the patch. The Percolator which ultimately handles writing to the Text widget should intercept the ValueError due to non-BMP characters. The issue14200_rev1.patch fixes this issue and bpo-13153. |
Roger, you are missing the difference between calling print() and evaluating expression in python interactive mode. For the rest I like your simplification. And definitelly IDLE should to work both in subprocess and embedded modes — thank you for that point. I'll make the final (I hope) patch a bit later. |
Andrew, I do admit that I have a lot to learn about Unicode support in Python, for instance with its error-handling and its corner cases. On Windows Vista, I do see that print() behaves differently than evaluating the expression. An exception is raised for: On Linux, I see the character print as ? in xterm and as a '?' when evaluated. In gnome-terminal (Ubuntu Mono font) it prints as a box containing the code point in hex. No exception is raised. I do see your point. The patch I provided always substitutes the unsupported character with its full expansion. Returning to a point earlier raised by Martin, using REPLACEMENT CHARACTER instead would be better. It would make the behavior of IDLE more consistent with xterm and gnome-terminal, although it would cause IDLE to hide errors if the program ran from a Windows console instead of IDLE. Given that Windows and Linux (Ubuntu) behave differently, I'd rather let IDLE mimic the behavior of a Linux console than a Windows console. |
I consulted with Martin at PyCon sprint and he suggested sulution which I'm following — to split Output passed to print() function encoded with sys.stdout.encoding UTF has been invented to support any character. xterm (old enough terminal) which you use cannot print non-BMP characters and replaces it with question marks. Let's return to non-UTF terminal encodings. andrew@tiktaalik ~/p/cpython> bash -c "LANG=C; ./python"
Python 3.3.0a1+ (qbase qtip tip tk:c3ce8a8e6c9c+, Mar 14 2012, 15:54:55)
[GCC 4.6.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> '\U00010340'
'\U00010340'
>>> print('\U00010340')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\U00010340' in position 0: ordinal not in range(128)
>>> As you can see I have switched LANG to C (alias for ASCII) locale. Eval printed with unicode escaping but The same for Windows, OS X and any other platform. |
As is for most other characters not supported in your OEM code
That's because your terminal output encoding is UTF-8. If you change
That's not a given, see above. |
I stand corrected. Thank you for the information. The behavior of the console depends on its locale. IDLE has no facility for changing the locale of the PyShell window. Should this option be included somewhere? |
I think that doesn't make sense. |
The Tkinter Text widget is the output for the IDLE shell and it has the limitation imposed by Tcl/Tk of not handling non-BMP unicode characters. Is the following reasonable: The IDLE shell console has a locale of "non-BMP utf8"? If so, would it be reasonable to add a menu item to switch locales for the shell? This amounts to adding some extra code to OutputWindow's write() to raise encoding errors if the string contains unsupported characters, and possibly replacing characters to work around Tcl/Tk's non-BMP limitation. |
It may be remotely desirable to be able to set the terminal encoding |
[BMP utf8]
Please open a separate issue for this. |
Martin, you are right. I created a separate issue bpo-14326. Let me know what I can do to help. |
New changeset 89878808f4ce by Andrew Svetlov in branch 'default': |
After experiments with non-BMP characters I figured out: Obviously Tk does not support non-BMP chars by spec while not rejects ones strictly. Details are implementation specific and depends not only from Tcl/Tk version but from concrete widget class. After that my position is: I experimented with Ubuntu box but pretty sure — the same result can be reproduced on OS X and Windows as well. Also we need to make Tk to be crossplatform — so replacing non-BMP is not bad but it is good solution until Tcl/Tk will process non-bmp in native manner. |
Closing again. Now IDLE works fine both in subprocess and inprocess mode. Future support of non-BMP can be continues after implementing codec for that — bpo-14304 Now I like to close that as «good enough for now». |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: