This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ezio.melotti
Recipients JBernardo, ezio.melotti, ned.deily, terry.reedy
Date 2011-10-15.20:25:20
SpamBayes Score 5.551115e-16
Marked as misclassified No
Message-id <1318710321.15.0.780031092829.issue13153@psf.upfronthosting.co.za>
In-reply-to
Content
> Ezio, do you know anything about these speculations?

Assuming that the non-BMP character is represented with two surrogates (\ud801\udca2) and that _tkinter tries to decode them independently, the error message ("invalid continuation byte") would be correct.

Python 2 UTF-8 codec is more permissive and allows encoding/decoding of surrogates (this might also explain why it works on Python 2): 
>>> u'\ud801'.encode('utf-8')
'\xed\xa0\x81'
>>> '\xed\xa0\x81'.decode('utf-8')
u'\ud801'

But on Python 3, trying to decode that results in an error:
>>> b'\xed\xa0\x81'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

> But then the problem should be the initial byte, not the continuation
> bytes, which are the same for all chars and which all have 10 for
> their two high order bits.

While it's true that all continuation bytes have the first two bits equal to '10', the opposite is not always true.  Some start bytes have additional restrictions on the continuation bytes.  For example, even if the first two bits of 0xA0 (0b10100000) are '10', the valid continuation bytes for a sequence starting with 0xED are restricted to the range 80..9F.

The fact that
>>> '\U000104a2'
'𐒢'
works is because the input is all ASCII, so the decoding doesn't fail.


> [...]
> This should catch any miscellaneous crashes which are not otherwise
> caught and maybe turn the crash issues into bug reports -- the same
> way that running from the command line did.

Having some "safe net" to catch all the unhandled exceptions seems like a good idea.  This won't work in case of segfaults, but it's still better than nothing.  I'm not sure what you mean with "turn them into bug reports" though.
History
Date User Action Args
2011-10-15 20:25:21ezio.melottisetrecipients: + ezio.melotti, terry.reedy, ned.deily, JBernardo
2011-10-15 20:25:21ezio.melottisetmessageid: <1318710321.15.0.780031092829.issue13153@psf.upfronthosting.co.za>
2011-10-15 20:25:20ezio.melottilinkissue13153 messages
2011-10-15 20:25:20ezio.melotticreate