classification
Title: UnicodeDecodeError that cannot be caught in narrow unicode builds
Type: behavior Stage:
Components: Unicode Versions: Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: amaury.forgeotdarc Nosy List: amaury.forgeotdarc, doerwalter, ggenellina, jafo, sbp
Priority: low Keywords: patch

Created on 2007-11-20 21:17 by sbp, last changed 2008-03-24 21:29 by amaury.forgeotdarc. This issue is now closed.

Files
File name Uploaded Description Edit
raw-unicode-escape.patch amaury.forgeotdarc, 2008-03-18 01:46
raw-unicode-escape2.patch amaury.forgeotdarc, 2008-03-20 23:04 2nd version
Messages (9)
msg57710 - (view) Author: Sean B. Palmer (sbp) Date: 2007-11-20 21:17
The following error is uncatchable:

>>> try: ur'\U0010FFFF'
... except UnicodeDecodeError: pass
... 
UnicodeDecodeError: 'rawunicodeescape' codec can't decode byte 0x5c 
in position 0: \Uxxxxxxxx out of range

This is in a narrow unicode build:

>>> sys.version_info, hex(sys.maxunicode)
((2, 5, 1, 'final', 0), '0xffff')

Of course the r in ur'...' is redundant in the test case above, but
there are cases in which it isn't...

>>> ur'\U0010FFFF\test'
u'\U0010ffff\\test'
- from a wide unicode build

>>> ur'\U0010FFFF\test'
UnicodeDecodeError: 'rawunicodeescape' codec can't decode byte 0x5c 
in position 0: \Uxxxxxxxx out of range
- from the narrow unicode build

The problem occurs with .decode('raw-unicode-escape') too.

>>> '\U0010FFFF\test'.decode('raw-unicode-escape')
Traceback (most recent call last):
[&c.]

Most surprisingly of all, however, this problem doesn't occur when you
don't use a raw string:

>>> u'\U0010ffff\\test'
u'\U0010ffff\\test'

So there is at least a workaround for all cases, which is why this bug
is marked as Severity: minor. It did take a while to work out that what
manifests with ur mightn't apply to u, however; it's usually one's first
thought to think the bug is with you, not with python.
msg63730 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2008-03-17 19:37
Can someone comment on this, or bring it up on python-dev if it needs
more discussion?
msg63840 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-18 01:46
The error is not uncatchable; but it is generated while compiling, like
a SyntaxError. No bytecode is generated for the input, and the "except"
opcode is not run at all.

OTOH, there is a bug in PyUnicode_DecodeRawUnicodeEscape(): it should
accept code points > 0xffff. It has another problem:

>>> ur'\U00010000'
u'\x00'

I join a patch to make raw-unicode-escape similar to unicode-escape:
characters outside the Basic Plane are encoded into a utf-16 surrogate
pair; on decoding, utf-16 surrogates are decoded into \U00xxxxxx.
msg64191 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2008-03-20 18:57
For a wide build, the code
        if (x <= 0xffff)
                *p++ = (Py_UNICODE) x;
        else {
                *p++ = (Py_UNIC0DE) x;

looks strange.

Furthermore with the patch applied Python no longer complains about
illegal code points:

>>> ur'\U11111111'
u'\u1c04\udd11'
msg64222 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-20 23:04
The "strange" code is a copy of PyUnicode_DecodeUnicodeEscape. I find it
easier to read. And the duplicate lines are likely to be optimized by
the compiler.

Here is a new version of the patch which:
- correctly forbid illegal code points
- compute the byte positions; this is important for error handlers

in python2.5, the end position was completely bogus:
>>> try: '\U11111111'.decode("raw-unicode-escape")
... except Exception, e: print repr(e)
UnicodeDecodeError('rawunicodeescape', '\\U11111111', 0, 504955452,
'\\Uxxxxxxxx out of range')
msg64322 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2008-03-22 14:34
The patch looks goog to me now. Go ahead and check it in.
msg64323 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2008-03-22 14:35
s/goog/good/g ;)
msg64353 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-23 09:57
Committed r61793. Will backport.
msg64442 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-24 21:29
backported to 2.5 branch as r61854
History
Date User Action Args
2008-03-24 21:29:24amaury.forgeotdarcsetstatus: pending -> closed
messages: + msg64442
2008-03-23 09:57:48amaury.forgeotdarcsetstatus: open -> pending
resolution: fixed
messages: + msg64353
2008-03-22 14:35:26doerwaltersetmessages: + msg64323
2008-03-22 14:34:51doerwaltersetassignee: doerwalter -> amaury.forgeotdarc
messages: + msg64322
2008-03-20 23:04:36amaury.forgeotdarcsetfiles: + raw-unicode-escape2.patch
messages: + msg64222
2008-03-20 18:57:42doerwaltersetmessages: + msg64191
2008-03-18 01:46:20amaury.forgeotdarcsetfiles: + raw-unicode-escape.patch
nosy: + amaury.forgeotdarc
messages: + msg63840
keywords: + patch
2008-03-17 19:37:04jafosetpriority: low
assignee: doerwalter
messages: + msg63730
nosy: + jafo, doerwalter
2007-11-22 06:45:23ggenellinasetnosy: + ggenellina
2007-11-20 21:17:39sbpcreate