This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: utf-32be codec failing on UCS-2 python build for 32-bit value
Type: crash Stage: test needed
Components: Unicode Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: doerwalter, ezio.melotti, lemburg, opstad, pitrou, vstinner
Priority: critical Keywords: patch

Created on 2010-06-08 16:15 by opstad, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
utf32.patch pitrou, 2010-06-09 12:03
utf32-2.patch pitrou, 2010-06-09 15:12
Messages (8)
msg107326 - (view) Author: Dave Opstad (opstad) Date: 2010-06-08 16:15
The utf-32 little-endian codec works fine, but the big-endian codec is producing incorrect results:

Python 3.1.2 (r312:79360M, Mar 24 2010, 01:33:18) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> str(b'\x00\x00\x01\x00', 'utf-32le') # works
'\U00010000'
>>> str(b'\x00\x01\x00\x00', 'utf-32be') # doesn't work
'\ud800\x02'
msg107386 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-09 11:27
Also witnessed on 2.x (UCS-2 build):

>>> unicode(b'\x00\x01\x00\x00', 'utf-32be')
u'\ud800\u0773'
>>> unicode(b'\x00\x00\x01\x00', 'utf-32le')
u'\U00010000'
msg107387 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-09 11:31
The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy when codec endianness doesn't match the native endianness (not to mention it could also crash if the underlying CPU arch doesn't support unaligned access to 4-byte integers):

#ifndef Py_UNICODE_WIDE
    for (i = pairs = 0; i < size/4; i++)
        if (((Py_UCS4 *)s)[i] >= 0x10000)
            pairs++;
#endif

As a result, the preallocated unicode object isn't long enough and Python writes into memory it shouldn't write into. It can produce hard crashes, such as:

>>> l = unicode(b'\x00\x01\x00\x00' * 1024, 'utf-32be')
Debug memory block at address p=0xf2b310:
    2050 bytes originally requested
    The 8 pad bytes at p-8 are FORBIDDENBYTE, as expected.
    The 8 pad bytes at tail=0xf2bb12 are not all FORBIDDENBYTE (0xfb):
        at tail+0: 0x00 *** OUCH
        at tail+1: 0xdc *** OUCH
        at tail+2: 0x00 *** OUCH
        at tail+3: 0xd8 *** OUCH
        at tail+4: 0x00 *** OUCH
        at tail+5: 0xdc *** OUCH
        at tail+6: 0x00 *** OUCH
        at tail+7: 0xd8 *** OUCH
    The block was made by call #61925422603698392 to debug malloc/realloc.
    Data at p: 00 d8 00 dc 00 d8 00 dc ... 00 dc 00 d8 00 dc 00 d8
Fatal Python error: bad trailing pad byte
Abandon
msg107389 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-09 12:03
Here is a simple patch. A test should be added, though.
msg107391 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-09 12:15
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> The following code at the beginning of PyUnicode_DecodeUTF32Stateful is buggy when codec endianness doesn't match the native endianness (not to mention it could also crash if the underlying CPU arch doesn't support unaligned access to 4-byte integers):
> 
> #ifndef Py_UNICODE_WIDE
>     for (i = pairs = 0; i < size/4; i++)
>         if (((Py_UCS4 *)s)[i] >= 0x10000)
>             pairs++;
> #endif

Good catch !

I wonder whether it wouldn't be better to preallocate
a Unicode object with size of e.g. size/4 + 16 and
then resize the object as necessary in case a surrogate
pair needs to be created (won't happen that often in
practice).

The extra scan for pairs can take long depending on
how much data you have to decode and likely doesn't
go down well with CPU caches.
msg107401 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-09 15:12
Here is a new patch with tests.

> I wonder whether it wouldn't be better to preallocate
> a Unicode object with size of e.g. size/4 + 16 and
> then resize the object as necessary in case a surrogate
> pair needs to be created (won't happen that often in
> practice).
> 
> The extra scan for pairs can take long depending on
> how much data you have to decode and likely doesn't
> go down well with CPU caches.

Perhaps, but I think this should measured and be the target of a separate issue. We're in rc phase and we should probably minimize potential disruption.
msg107404 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-09 16:36
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> Here is a new patch with tests.
> 
>> I wonder whether it wouldn't be better to preallocate
>> a Unicode object with size of e.g. size/4 + 16 and
>> then resize the object as necessary in case a surrogate
>> pair needs to be created (won't happen that often in
>> practice).
>>
>> The extra scan for pairs can take long depending on
>> how much data you have to decode and likely doesn't
>> go down well with CPU caches.
> 
> Perhaps, but I think this should measured and be the target of a separate issue. We're in rc phase and we should probably minimize potential disruption.

Fair enough.

Here's a little optimization:

-        if (qq[iorder[3]] != 0 || qq[iorder[2]] != 0)
+        if (qq[iorder[2]] != 0 || qq[iorder[3]] != 0)

For non-BMP code points, it's more likely that byte 2
will be non-zero.
msg107588 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-11 21:53
Fixed in r81907 (trunk), r81908 (py3k), r81909 (2.6), r81910 (3.1).
History
Date User Action Args
2022-04-11 14:57:02adminsetgithub: 53187
2010-06-11 21:53:05pitrousetstatus: open -> closed
resolution: fixed
messages: + msg107588
2010-06-09 16:36:55lemburgsetmessages: + msg107404
2010-06-09 15:12:40pitrousetfiles: + utf32-2.patch

messages: + msg107401
2010-06-09 12:26:09ezio.melottisetnosy: + ezio.melotti

stage: test needed
2010-06-09 12:15:08lemburgsetmessages: + msg107391
title: utf-32be codec failing on UCS-2 python build for 32-bit value -> utf-32be codec failing on UCS-2 python build for 32-bit value
2010-06-09 12:03:29pitrousetfiles: + utf32.patch
keywords: + patch
messages: + msg107389
2010-06-09 11:48:34pitrousetnosy: + doerwalter
2010-06-09 11:31:18pitrousetpriority: high -> critical
type: behavior -> crash
messages: + msg107387
2010-06-09 11:27:07pitrousetpriority: normal -> high
title: utf-32be codec failing on 16-bit python build for 32-bit value -> utf-32be codec failing on UCS-2 python build for 32-bit value
nosy: + pitrou, vstinner, lemburg

messages: + msg107386

versions: + Python 2.6, Python 2.7, Python 3.2
2010-06-08 16:15:22opstadcreate