classification
Title: Calling .lower() on certain unicode string raises SystemError
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.4, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: amaury.forgeotdarc, benjamin.peterson, davechallis, ezio.melotti, haypo, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-06-10 14:20 by davechallis, last changed 2013-06-12 06:30 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
test_issue18183.patch serhiy.storchaka, 2013-06-10 20:25 review
Messages (11)
msg190907 - (view) Author: Dave Challis (davechallis) Date: 2013-06-10 14:20
This occurred when attempting to decode invalid UTF-8 bytes using "errors='replace'", then attempting to lowercase the produced unicode string.

This was also tested in python 2.7, but it doesn't occur there.

Code to reproduce:

x = b'\xe2\xb3\x99\xb3\xd1\x9f\xe0vjGd|\x12\xf2\x84\xac\xae&$\xa4\xae+\xa4sbtf$&fG\xfb\xe6?.\xe2sbv\x14\xcb\x89\x98\xda\xd9\x99\xda\xb9d9\x1bY\x99\xb7\xb3\x1b9\xa2y*B\xa3\xba\xefj&g\xe2\x92Et\x85~\xbf\x8a\xe3\x919\x8bvc\xfb#$$.\xber6D&b.#4\xa4.\x13RtI\x10\xed\x9c\xd0\x98\xb8\x18\x91\x99\\\nC\x13\x8dV\xccL\xf4\x89\x9c\x90'

x = x.decode('utf-8', errors='replace')

x.lower()


Output:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: invalid maximum character passed to PyUnicode_New
msg190916 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-10 15:49
Minimal example:

>>> '\U00010000\U00100000'.lower()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: invalid maximum character passed to PyUnicode_New
msg190917 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-10 15:53
It happens due to use fast MAX_MAXCHAR() which can produce maxchar out of range (0x10000 | 0x100000 > MAX_UNICODE).
msg190919 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-06-10 15:57
>>> a = chr(0x84b2e)+chr(0x109710)
>>> a.lower()
SystemError: invalid maximum character passed to PyUnicode_New

The MAX_MAXCHAR() macro only works for 'maxchar' values, like 0xff, 0xffff...  in do_upper_or_lower() it's used with arbitrary UCS4 values.
msg190923 - (view) Author: Roundup Robot (python-dev) Date: 2013-06-10 16:24
New changeset 89b106d298a9 by Benjamin Peterson in branch '3.3':
remove MAX_MAXCHAR because it's unsafe for computing maximum codepoitn value (see #18183)
http://hg.python.org/cpython/rev/89b106d298a9

New changeset 668aba845fb2 by Benjamin Peterson in branch 'default':
merge 3.3 (#18183)
http://hg.python.org/cpython/rev/668aba845fb2
msg190924 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-06-10 16:26
I simply removed the MAX_MAXCHAR micro-optimization, since it seems fairly unsafe. Interested parties can restore it safely if they wish.
msg190925 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-06-10 16:51
Oops, my MAX_MAXCHAR macro was too optimized :-) (the result is incorrect)

It shows us that the test suite does not have enough test on non-BMP characters.
msg190930 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-10 20:25
Here are additional tests for this issue.
msg190932 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-06-10 20:43
+        '\U00010000\U00100000'.lower()

Why not checking the result of these calls?
msg190936 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-10 21:03
The result is trivial. Is not checking the result distract an attention from the main issue?
msg191013 - (view) Author: Roundup Robot (python-dev) Date: 2013-06-12 06:29
New changeset b11507395ce4 by Serhiy Storchaka in branch '3.3':
Add tests for issue #18183.
http://hg.python.org/cpython/rev/b11507395ce4

New changeset 17c9f1627baf by Serhiy Storchaka in branch 'default':
Add tests for issue #18183.
http://hg.python.org/cpython/rev/17c9f1627baf
History
Date User Action Args
2013-06-12 06:30:40serhiy.storchakasetstatus: open -> closed
stage: patch review -> resolved
2013-06-12 06:29:04python-devsetmessages: + msg191013
2013-06-10 21:03:52serhiy.storchakasetmessages: + msg190936
2013-06-10 20:43:15hayposetmessages: + msg190932
2013-06-10 20:25:41serhiy.storchakasetstatus: closed -> open
files: + test_issue18183.patch
messages: + msg190930

keywords: + patch
stage: needs patch -> patch review
2013-06-10 16:51:08hayposetmessages: + msg190925
2013-06-10 16:26:55benjamin.petersonsetstatus: open -> closed

nosy: + benjamin.peterson
messages: + msg190924

resolution: fixed
2013-06-10 16:24:12python-devsetnosy: + python-dev
messages: + msg190923
2013-06-10 15:57:23amaury.forgeotdarcsetnosy: + amaury.forgeotdarc, haypo
messages: + msg190919
2013-06-10 15:53:14serhiy.storchakasetmessages: + msg190917
2013-06-10 15:50:38serhiy.storchakasetassignee: serhiy.storchaka
2013-06-10 15:49:51serhiy.storchakasetmessages: + msg190916
2013-06-10 14:55:13serhiy.storchakasetnosy: + serhiy.storchaka
stage: needs patch

components: + Interpreter Core
versions: + Python 3.4
2013-06-10 14:20:36davechalliscreate