Title: UnicodeDecodeError that cannot be caught in narrow unicode builds
Components: Unicode Versions: Python 2.5
Created on 2007-11-20 21:17 by sbp, last changed 2022-04-11 14:56 by admin.

msg57710 - (view) Author: Sean B. Palmer (sbp) Date: 2007-11-20 21:17
The following error is uncatchable:

>>> try: ur'\U0010FFFF'
... except UnicodeDecodeError: pass
UnicodeDecodeError: 'rawunicodeescape' codec can't decode byte 0x5c 
in position 0: \Uxxxxxxxx out of range

This is in a narrow unicode build:

>>> sys.version_info, hex(sys.maxunicode)
((2, 5, 1, 'final', 0), '0xffff')

Of course the r in ur'...' is redundant in the test case above, but
there are cases in which it isn't...

>>> ur'\U0010FFFF\test'
- from a wide unicode build

>>> ur'\U0010FFFF\test'
UnicodeDecodeError: 'rawunicodeescape' codec can't decode byte 0x5c 
in position 0: \Uxxxxxxxx out of range
- from the narrow unicode build

The problem occurs with .decode('raw-unicode-escape') too.

>>> '\U0010FFFF\test'.decode('raw-unicode-escape')
Traceback (most recent call last):

Most surprisingly of all, however, this problem doesn't occur when you
don't use a raw string:

>>> u'\U0010ffff\\test'

So there is at least a workaround for all cases, which is why this bug
is marked as Severity: minor. It did take a while to work out that what
manifests with ur mightn't apply to u, however; it's usually one's first
thought to think the bug is with you, not with python.
msg63730 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2008-03-17 19:37
Can someone comment on this, or bring it up on python-dev if it needs
more discussion?
msg63840 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-18 01:46
The error is not uncatchable; but it is generated while compiling, like
a SyntaxError. No bytecode is generated for the input, and the "except"
opcode is not run at all.

OTOH, there is a bug in PyUnicode_DecodeRawUnicodeEscape(): it should
accept code points > 0xffff. It has another problem:

>>> ur'\U00010000'

I join a patch to make raw-unicode-escape similar to unicode-escape:
characters outside the Basic Plane are encoded into a utf-16 surrogate
pair; on decoding, utf-16 surrogates are decoded into \U00xxxxxx.
msg64191 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2008-03-20 18:57
For a wide build, the code
        if (x <= 0xffff)
                *p++ = (Py_UNICODE) x;
        else {
                *p++ = (Py_UNIC0DE) x;

looks strange.

Furthermore with the patch applied Python no longer complains about
illegal code points:

>>> ur'\U11111111'
msg64222 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-20 23:04
The "strange" code is a copy of PyUnicode_DecodeUnicodeEscape. I find it
easier to read. And the duplicate lines are likely to be optimized by
the compiler.

Here is a new version of the patch which:
- correctly forbid illegal code points
- compute the byte positions; this is important for error handlers

in python2.5, the end position was completely bogus:
>>> try: '\U11111111'.decode("raw-unicode-escape")
... except Exception, e: print repr(e)
UnicodeDecodeError('rawunicodeescape', '\\U11111111', 0, 504955452,
'\\Uxxxxxxxx out of range')
msg64322 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2008-03-22 14:34
The patch looks goog to me now. Go ahead and check it in.
msg64323 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2008-03-22 14:35
s/goog/good/g ;)
msg64353 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-23 09:57
Committed r61793. Will backport.
msg64442 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-24 21:29
backported to 2.5 branch as r61854
