Issue 17223: Initializing array.array with unicode type code and buffer segfaults

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61425

classification

Title:	Initializing array.array with unicode type code and buffer segfaults
Type:	crash	Stage:	patch review
Components:	Library (Lib), Unicode	Versions:	Python 3.3, Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, jcea, mjacob, python-dev, sbt, vstinner
Priority:	normal	Keywords:	patch

Created on 2013-02-17 23:45 by mjacob, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue17223.diff	mjacob, 2013-02-18 00:55		review
issue17223_with_test.diff	mjacob, 2013-02-18 01:33		review

Messages (26)
msg182291 - (view)	Author: Manuel Jacob (mjacob) *	Date: 2013-02-17 23:45
>>> from array import array >>> str(array('u', b'asdf')) [1] 19411 segmentation fault (core dumped) python This error occures with Python 3.3 and hg tip but not with Python 3.2.
msg182292 - (view)	Author: Manuel Jacob (mjacob) *	Date: 2013-02-18 00:55
The attached patch fixes the crash. Output: >>> from array import array >>> str(array('u', b'asdf')) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: character U+66647361 is not in range [U+0000; U+10ffff]
msg182293 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-02-18 01:04
Thanks for the report and the patch. Could you also include a test for this?
msg182294 - (view)	Author: Manuel Jacob (mjacob) *	Date: 2013-02-18 01:33
I've attached a new patch with a test that segfaults on Python 3.3 and passes on hg tip with the patch applied.
msg182299 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-02-18 06:15
issue17223_with_test.diff looks good to me (we may just drop {...} around return NULL).
msg182709 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-02-23 03:22
Shouldn't this still work on 3.3/3.4? In Modules/arraymodule.c:1562, in the array_tounicode function, there is: return PyUnicode_FromUnicode((Py_UNICODE *) self->ob_item, Py_SIZE(self)); I think this should be updated to work with the PEP 393 implementation, rather than raising an error.
msg182810 - (view)	Author: Manuel Jacob (mjacob) *	Date: 2013-02-23 20:11
http://docs.python.org/3/library/array.html states that the 'u' type code is deprecated together with the rest of the Py_UNICODE API (which includes PyUnicode_FromUnicode), so keeping this using PyUnicode_FromUnicode should be fine.
msg182912 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-02-25 03:19
Even if deprecated it should continue to work (if possible).
msg182967 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-02-25 18:23
If the problem is that PyUnicode_FromUnicode() rejects character outside range [U+0000; U+10ffff], it would be better to use the byte string '\xff' * sizeof_PY_UNICODE. U+66647361 may become valid in a future version of Unicode, I don't thing that U+FFFFFFFF would become valid. sizeof_PY_UNICODE is ctypes.sizeof(ctypes.c_wchar) since Python 3.3. '\xff' * 4 works on any platform.
msg182996 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-02-25 22:32
> If the problem is that PyUnicode_FromUnicode() rejects character > outside range [U+0000; U+10ffff], But this used to return two valid characters: >>> str(array('u', b'asdf')) "array('u', '獡晤')" so I think it still should -- unless the operation was already nonsensical and/or there's no way to do the same thing on 3.3+ due to the change introduced by PEP 393. > it would be better to use the byte string '\xff' * sizeof_PY_UNICODE. What for? > U+66647361 may become valid in a future version of Unicode, It won't.
msg182998 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-02-25 22:47
We discussed this on IRC, and apparently the seemingly valid result I got on 3.2 was because I had a narrow build. On a wide 3.2 build I get: >>> str(array('u', b'asdf')) "array('u', '\\U66647361')" Since 3.3+ behaves like a wide build and since \U66647361 is not valid, I now agree that raising an error is the right thing to do. If possible, even 3.2 should raise an error, rather than returning an invalid codepoint.
msg182999 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-02-25 23:20
New changeset c354afedb866 by Victor Stinner in branch '3.3': Issue #17223: Fix PyUnicode_FromUnicode() for string of 1 character outside http://hg.python.org/cpython/rev/c354afedb866 New changeset a4295ab52427 by Victor Stinner in branch 'default': (Merge 3.3) Issue #17223: Fix PyUnicode_FromUnicode() for string of 1 character http://hg.python.org/cpython/rev/a4295ab52427
msg183000 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-02-25 23:28
New changeset ebeed44702ec by Victor Stinner in branch '3.3': Issue #17223: array module: Fix a crasher when converting an array containing http://hg.python.org/cpython/rev/ebeed44702ec New changeset 381de621ff6a by Victor Stinner in branch 'default': (Merge 3.3) Issue #17223: array module: Fix a crasher when converting an array http://hg.python.org/cpython/rev/381de621ff6a
msg183001 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-02-25 23:31
> I think this should be updated to work with the PEP 393 implementation, rather than raising an error. It was discussed to add new formats for UCS1, UCS2 and UCS4 formats to the array module, but nobody implemented the idea. The "u" format is kept unchanged (use Py_UNICODE / wchar_t) for backward compatibility with Python 3.2. -- I found another bug while trying Manuel's patch :-/ It's now fixed. @Manuel: Thanks for your patch!
msg183053 - (view)	Author: Richard Oudkerk (sbt) *	Date: 2013-02-26 13:23
The new test seems to be reliably failing on Windows: ====================================================================== FAIL: test_issue17223 (__main__.UnicodeTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\Repos\cpython-dirty\lib\test\test_array.py", line 1075, in test_issue17223 self.assertRaises(ValueError, a.tounicode) AssertionError: ValueError not raised by tounicode ----------------------------------------------------------------------
msg183061 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-02-26 14:34
On Windows, test_array can use an invalid surrogate pair to test this issue: b'\xff\xdf\x61\x00' for example. I don't know how to easily check the size of wchar_t. ctypes.sizeof(ctypes.c_wchar) can be used, but ctypes is not always available. sys.unicode is now always 0x10ffff since Python 3.3. PyUnicode_GetMax() is not accessible in Python.
msg183094 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-02-26 21:52
New changeset 66e9d0185b0f by Victor Stinner in branch '3.3': Issue #17223: Fix test_array on Windows (16-bit wchar_t/Py_UNICODE) http://hg.python.org/cpython/rev/66e9d0185b0f New changeset 5aaf6bc1d502 by Victor Stinner in branch 'default': (Merge 3.3) Issue #17223: Fix test_array on Windows (16-bit wchar_t/Py_UNICODE) http://hg.python.org/cpython/rev/5aaf6bc1d502
msg183095 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-02-26 21:54
"It was discussed to add new formats for UCS1, UCS2 and UCS4 formats to the array module, but nobody implemented the idea. The "u" format is kept unchanged (use Py_UNICODE / wchar_t) for backward compatibility with Python 3.2." See also issue #13072 for this discussion.
msg183546 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-03-05 18:54
Tests are still failing on Windows: http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.x/builds/1558/steps/test/logs/stdio
msg183561 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-03-05 23:42
> ezio.melotti: Tests are still failing on Windows Oh, I read the PyUnicode_FromUnicode() twice and there is a bug :-( With 16-bit wchar_t type (on Windows), find_maxchar_surrogates() doesn't fail if the wchar_* string contains in invalid surrogate pair.
msg183562 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-03-05 23:42
New changeset 15190138d3f3 by Victor Stinner in branch 'default': Issue #17223: Add another test to check that _PyUnicode_Ready() rejects http://hg.python.org/cpython/rev/15190138d3f3 New changeset b9f7b1bf36aa by Victor Stinner in branch 'default': Issue #17223: Fix PyUnicode_FromUnicode() on Windows (16-bit wchar_t type) http://hg.python.org/cpython/rev/b9f7b1bf36aa
msg183563 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-03-05 23:44
Changeset b9f7b1bf36aa should fix the test on Windows. My Windows VM is dead, I cannot test myself. If the fix works, it must be backported in Python 3.3.
msg183565 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-03-06 00:21
> Changeset b9f7b1bf36aa should fix the test on Windows. Oh, many tests are failing because of this change, so I reverted it. ====================================================================== ERROR: test_surrogatepass_handler (test.test_codecs.CP65001Test) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_codecs.py", line 799, in test_surrogatepass_handler "abc\ud800def") File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\case.py", line 642, in assertEqual assertion_func(first, second, msg=msg) File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\case.py", line 1007, in assertMultiLineEqual if first != second: ValueError: illegal UTF-16 surrogate ====================================================================== ERROR: test_unicode (test.test_array.ArrayReconstructorTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_array.py", line 183, in test_unicode msg="{0!r} != {1!r}; testcase={2!r}".format(a, b, testcase)) File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\case.py", line 642, in assertEqual assertion_func(first, second, msg=msg) File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\case.py", line 632, in _baseAssertEqual if not first == second: ValueError: illegal UTF-16 surrogate ====================================================================== ERROR: test_byteswap (test.test_array.UnicodeTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_array.py", line 239, in test_byteswap self.assertNotEqual(a, b) File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\case.py", line 648, in assertNotEqual if not first != second: ValueError: illegal UTF-16 surrogate ====================================================================== FAIL: test_issue17223 (test.test_array.UnicodeTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_array.py", line 1082, in test_issue17223 self.assertRaises(ValueError, a.tounicode) AssertionError: ValueError not raised by tounicode
msg183717 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-03-08 01:27
It looks like PyUnicode_FromUnicode() should accept invalid UTF-16 surrogates because the array module indirectly relies on that: On Windows (16-bit wchar_t/Py_UNICODE), len(array.array('u', '\U0010ffff')) is 2 and array.array('u', '\U0010ffff')[0] is '\udbff' (lone surrogate).
msg183718 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-03-08 01:33
New changeset 1fd165883a65 by Victor Stinner in branch '3.3': Issue #17223: the test is specific to 32-bit wchar_t type http://hg.python.org/cpython/rev/1fd165883a65 New changeset 42970cbfc982 by Victor Stinner in branch 'default': (Merge 3.3) Issue #17223: the test is specific to 32-bit wchar_t type http://hg.python.org/cpython/rev/42970cbfc982
msg183719 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-03-08 01:34
The test should now pass on Windows.

History
Date	User	Action	Args
2022-04-11 14:57:41	admin	set	github: 61425
2013-03-08 01:34:32	vstinner	set	status: open -> closed messages: + msg183719
2013-03-08 01:33:56	python-dev	set	messages: + msg183718
2013-03-08 01:27:36	vstinner	set	messages: + msg183717
2013-03-06 00:21:36	vstinner	set	messages: + msg183565
2013-03-05 23:44:13	vstinner	set	messages: + msg183563
2013-03-05 23:42:20	python-dev	set	messages: + msg183562
2013-03-05 23:42:06	vstinner	set	messages: + msg183561
2013-03-05 18:54:25	ezio.melotti	set	messages: + msg183546
2013-02-26 21:54:13	vstinner	set	messages: + msg183095
2013-02-26 21:52:42	python-dev	set	messages: + msg183094
2013-02-26 14:34:12	vstinner	set	messages: + msg183061
2013-02-26 13:23:03	sbt	set	status: closed -> open nosy: + sbt messages: + msg183053
2013-02-25 23:31:17	vstinner	set	status: open -> closed resolution: fixed messages: + msg183001
2013-02-25 23:28:52	python-dev	set	messages: + msg183000
2013-02-25 23:20:34	python-dev	set	nosy: + python-dev messages: + msg182999
2013-02-25 22:47:41	ezio.melotti	set	messages: + msg182998
2013-02-25 22:32:35	ezio.melotti	set	messages: + msg182996
2013-02-25 18:23:55	vstinner	set	messages: + msg182967
2013-02-25 03:19:48	ezio.melotti	set	messages: + msg182912
2013-02-23 20:11:25	mjacob	set	messages: + msg182810
2013-02-23 03:22:12	ezio.melotti	set	messages: + msg182709 versions: - Python 2.7, Python 3.2
2013-02-18 08:32:56	jcea	set	nosy: + jcea
2013-02-18 06:15:45	vstinner	set	messages: + msg182299
2013-02-18 06:13:58	vstinner	set	versions: + Python 2.7, Python 3.2
2013-02-18 01:33:16	mjacob	set	files: + issue17223_with_test.diff messages: + msg182294
2013-02-18 01:04:56	ezio.melotti	set	messages: + msg182293
2013-02-18 01:02:43	ezio.melotti	set	nosy: + vstinner stage: patch review
2013-02-18 00:55:22	mjacob	set	files: + issue17223.diff keywords: + patch messages: + msg182292
2013-02-17 23:45:05	mjacob	create