Issue 8941: utf-32be codec failing on UCS-2 python build for 32-bit value

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53187

classification

Title:	utf-32be codec failing on UCS-2 python build for 32-bit value
Type:	crash	Stage:	test needed
Components:	Unicode	Versions:	Python 3.1, Python 3.2, Python 2.7, Python 2.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	doerwalter, ezio.melotti, lemburg, opstad, pitrou, vstinner
Priority:	critical	Keywords:	patch

Created on 2010-06-08 16:15 by opstad, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
utf32.patch	pitrou, 2010-06-09 12:03
utf32-2.patch	pitrou, 2010-06-09 15:12

Messages (8)
msg107326 - (view)	Author: Dave Opstad (opstad)	Date: 2010-06-08 16:15
The utf-32 little-endian codec works fine, but the big-endian codec is producing incorrect results: Python 3.1.2 (r312:79360M, Mar 24 2010, 01:33:18) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> str(b'\x00\x00\x01\x00', 'utf-32le') # works '\U00010000' >>> str(b'\x00\x01\x00\x00', 'utf-32be') # doesn't work '\ud800\x02'
msg107386 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-06-09 11:27
Also witnessed on 2.x (UCS-2 build): >>> unicode(b'\x00\x01\x00\x00', 'utf-32be') u'\ud800\u0773' >>> unicode(b'\x00\x00\x01\x00', 'utf-32le') u'\U00010000'
msg107387 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-06-09 11:31
The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy when codec endianness doesn't match the native endianness (not to mention it could also crash if the underlying CPU arch doesn't support unaligned access to 4-byte integers): #ifndef Py_UNICODE_WIDE for (i = pairs = 0; i < size/4; i++) if (((Py_UCS4 )s)[i] >= 0x10000) pairs++; #endif As a result, the preallocated unicode object isn't long enough and Python writes into memory it shouldn't write into. It can produce hard crashes, such as: >>> l = unicode(b'\x00\x01\x00\x00' 1024, 'utf-32be') Debug memory block at address p=0xf2b310: 2050 bytes originally requested The 8 pad bytes at p-8 are FORBIDDENBYTE, as expected. The 8 pad bytes at tail=0xf2bb12 are not all FORBIDDENBYTE (0xfb): at tail+0: 0x00 * OUCH at tail+1: 0xdc * OUCH at tail+2: 0x00 * OUCH at tail+3: 0xd8 * OUCH at tail+4: 0x00 * OUCH at tail+5: 0xdc * OUCH at tail+6: 0x00 * OUCH at tail+7: 0xd8 * OUCH The block was made by call #61925422603698392 to debug malloc/realloc. Data at p: 00 d8 00 dc 00 d8 00 dc ... 00 dc 00 d8 00 dc 00 d8 Fatal Python error: bad trailing pad byte Abandon
msg107389 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-06-09 12:03
Here is a simple patch. A test should be added, though.
msg107391 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-06-09 12:15
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy when codec endianness doesn't match the native endianness (not to mention it could also crash if the underlying CPU arch doesn't support unaligned access to 4-byte integers): > > #ifndef Py_UNICODE_WIDE > for (i = pairs = 0; i < size/4; i++) > if (((Py_UCS4 *)s)[i] >= 0x10000) > pairs++; > #endif Good catch ! I wonder whether it wouldn't be better to preallocate a Unicode object with size of e.g. size/4 + 16 and then resize the object as necessary in case a surrogate pair needs to be created (won't happen that often in practice). The extra scan for pairs can take long depending on how much data you have to decode and likely doesn't go down well with CPU caches.
msg107401 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-06-09 15:12
Here is a new patch with tests. > I wonder whether it wouldn't be better to preallocate > a Unicode object with size of e.g. size/4 + 16 and > then resize the object as necessary in case a surrogate > pair needs to be created (won't happen that often in > practice). > > The extra scan for pairs can take long depending on > how much data you have to decode and likely doesn't > go down well with CPU caches. Perhaps, but I think this should measured and be the target of a separate issue. We're in rc phase and we should probably minimize potential disruption.
msg107404 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-06-09 16:36
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > Here is a new patch with tests. > >> I wonder whether it wouldn't be better to preallocate >> a Unicode object with size of e.g. size/4 + 16 and >> then resize the object as necessary in case a surrogate >> pair needs to be created (won't happen that often in >> practice). >> >> The extra scan for pairs can take long depending on >> how much data you have to decode and likely doesn't >> go down well with CPU caches. > > Perhaps, but I think this should measured and be the target of a separate issue. We're in rc phase and we should probably minimize potential disruption. Fair enough. Here's a little optimization: - if (qq[iorder[3]] != 0 \|\| qq[iorder[2]] != 0) + if (qq[iorder[2]] != 0 \|\| qq[iorder[3]] != 0) For non-BMP code points, it's more likely that byte 2 will be non-zero.
msg107588 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-06-11 21:53
Fixed in r81907 (trunk), r81908 (py3k), r81909 (2.6), r81910 (3.1).

History
Date	User	Action	Args
2022-04-11 14:57:02	admin	set	github: 53187
2010-06-11 21:53:05	pitrou	set	status: open -> closed resolution: fixed messages: + msg107588
2010-06-09 16:36:55	lemburg	set	messages: + msg107404
2010-06-09 15:12:40	pitrou	set	files: + utf32-2.patch messages: + msg107401
2010-06-09 12:26:09	ezio.melotti	set	nosy: + ezio.melotti stage: test needed
2010-06-09 12:15:08	lemburg	set	messages: + msg107391 title: utf-32be codec failing on UCS-2 python build for 32-bit value -> utf-32be codec failing on UCS-2 python build for 32-bit value
2010-06-09 12:03:29	pitrou	set	files: + utf32.patch keywords: + patch messages: + msg107389
2010-06-09 11:48:34	pitrou	set	nosy: + doerwalter
2010-06-09 11:31:18	pitrou	set	priority: high -> critical type: behavior -> crash messages: + msg107387
2010-06-09 11:27:07	pitrou	set	priority: normal -> high title: utf-32be codec failing on 16-bit python build for 32-bit value -> utf-32be codec failing on UCS-2 python build for 32-bit value nosy: + pitrou, vstinner, lemburg messages: + msg107386 versions: + Python 2.6, Python 2.7, Python 3.2
2010-06-08 16:15:22	opstad	create