Issue1477
Created on 2007-11-20 21:17 by sbp, last changed 2008-03-24 21:29 by amaury.forgeotdarc.
|
msg57710 - (view) |
Author: Sean B. Palmer (sbp) |
Date: 2007-11-20 21:17 |
|
The following error is uncatchable:
>>> try: ur'\U0010FFFF'
... except UnicodeDecodeError: pass
...
UnicodeDecodeError: 'rawunicodeescape' codec can't decode byte 0x5c
in position 0: \Uxxxxxxxx out of range
This is in a narrow unicode build:
>>> sys.version_info, hex(sys.maxunicode)
((2, 5, 1, 'final', 0), '0xffff')
Of course the r in ur'...' is redundant in the test case above, but
there are cases in which it isn't...
>>> ur'\U0010FFFF\test'
u'\U0010ffff\\test'
- from a wide unicode build
>>> ur'\U0010FFFF\test'
UnicodeDecodeError: 'rawunicodeescape' codec can't decode byte 0x5c
in position 0: \Uxxxxxxxx out of range
- from the narrow unicode build
The problem occurs with .decode('raw-unicode-escape') too.
>>> '\U0010FFFF\test'.decode('raw-unicode-escape')
Traceback (most recent call last):
[&c.]
Most surprisingly of all, however, this problem doesn't occur when you
don't use a raw string:
>>> u'\U0010ffff\\test'
u'\U0010ffff\\test'
So there is at least a workaround for all cases, which is why this bug
is marked as Severity: minor. It did take a while to work out that what
manifests with ur mightn't apply to u, however; it's usually one's first
thought to think the bug is with you, not with python.
|
|
msg63730 - (view) |
Author: Sean Reifschneider (jafo) |
Date: 2008-03-17 19:37 |
|
Can someone comment on this, or bring it up on python-dev if it needs
more discussion?
|
|
msg63840 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) |
Date: 2008-03-18 01:46 |
|
The error is not uncatchable; but it is generated while compiling, like
a SyntaxError. No bytecode is generated for the input, and the "except"
opcode is not run at all.
OTOH, there is a bug in PyUnicode_DecodeRawUnicodeEscape(): it should
accept code points > 0xffff. It has another problem:
>>> ur'\U00010000'
u'\x00'
I join a patch to make raw-unicode-escape similar to unicode-escape:
characters outside the Basic Plane are encoded into a utf-16 surrogate
pair; on decoding, utf-16 surrogates are decoded into \U00xxxxxx.
|
|
msg64191 - (view) |
Author: Walter Dörwald (doerwalter) |
Date: 2008-03-20 18:57 |
|
For a wide build, the code
if (x <= 0xffff)
*p++ = (Py_UNICODE) x;
else {
*p++ = (Py_UNIC0DE) x;
looks strange.
Furthermore with the patch applied Python no longer complains about
illegal code points:
>>> ur'\U11111111'
u'\u1c04\udd11'
|
|
msg64222 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) |
Date: 2008-03-20 23:04 |
|
The "strange" code is a copy of PyUnicode_DecodeUnicodeEscape. I find it
easier to read. And the duplicate lines are likely to be optimized by
the compiler.
Here is a new version of the patch which:
- correctly forbid illegal code points
- compute the byte positions; this is important for error handlers
in python2.5, the end position was completely bogus:
>>> try: '\U11111111'.decode("raw-unicode-escape")
... except Exception, e: print repr(e)
UnicodeDecodeError('rawunicodeescape', '\\U11111111', 0, 504955452,
'\\Uxxxxxxxx out of range')
|
|
msg64322 - (view) |
Author: Walter Dörwald (doerwalter) |
Date: 2008-03-22 14:34 |
|
The patch looks goog to me now. Go ahead and check it in.
|
|
msg64323 - (view) |
Author: Walter Dörwald (doerwalter) |
Date: 2008-03-22 14:35 |
|
s/goog/good/g ;)
|
|
msg64353 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) |
Date: 2008-03-23 09:57 |
|
Committed r61793. Will backport.
|
|
msg64442 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) |
Date: 2008-03-24 21:29 |
|
backported to 2.5 branch as r61854
|
|
| Date |
User |
Action |
Args |
| 2008-03-24 21:29:24 | amaury.forgeotdarc | set | status: pending -> closed messages:
+ msg64442 |
| 2008-03-23 09:57:48 | amaury.forgeotdarc | set | status: open -> pending resolution: fixed messages:
+ msg64353 |
| 2008-03-22 14:35:26 | doerwalter | set | messages:
+ msg64323 |
| 2008-03-22 14:34:51 | doerwalter | set | assignee: doerwalter -> amaury.forgeotdarc messages:
+ msg64322 |
| 2008-03-20 23:04:36 | amaury.forgeotdarc | set | files:
+ raw-unicode-escape2.patch messages:
+ msg64222 |
| 2008-03-20 18:57:42 | doerwalter | set | messages:
+ msg64191 |
| 2008-03-18 01:46:20 | amaury.forgeotdarc | set | files:
+ raw-unicode-escape.patch nosy:
+ amaury.forgeotdarc messages:
+ msg63840 keywords:
+ patch |
| 2008-03-17 19:37:04 | jafo | set | priority: low assignee: doerwalter messages:
+ msg63730 nosy:
+ jafo, doerwalter |
| 2007-11-22 06:45:23 | gagenellina | set | nosy:
+ gagenellina |
| 2007-11-20 21:17:39 | sbp | create | |
|