classification
Title: PyUnicode_AsUTF8AndSize Sometimes Segfaults With Incomplete Surrogate Pair
Type: Stage: resolved
Components: Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: serhiy.storchaka, william.ayd
Priority: normal Keywords:

Created on 2019-12-21 03:32 by william.ayd, last changed 2019-12-21 16:48 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
testmodule.c william.ayd, 2019-12-21 03:32 Extension Module For Use in Identifying Segfault
Messages (3)
msg358755 - (view) Author: (william.ayd) * Date: 2019-12-21 03:32
With the attached extension module, if I run the following in the REPL:

>>> import libtest
>>>
>>> libtest.error_if_not_utf8("foo")
'foo'
>>> libtest.error_if_not_utf8("\ud83d")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
>>> libtest.error_if_not_utf8("foo")
'foo'

Things seem OK. But the next invocation of

>>> libtest.error_if_not_utf8("\ud83d")

Then causes a segfault. Note that the order of the input seems important; simply repeating the call with the invalid surrogate doesn't cause the segfault
msg358757 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-12-21 05:40
Your function returns a borrowed reference. It xiuld cause  ceash even without calling PyUnicode_AsUTF8AndSize. Add Py_INCREF(str)
msg358762 - (view) Author: (william.ayd) * Date: 2019-12-21 07:15
Hmm my mistake - thanks!
History
Date User Action Args
2019-12-21 16:48:44serhiy.storchakasetstatus: open -> closed
stage: resolved
2019-12-21 07:15:50william.aydsetmessages: + msg358762
2019-12-21 05:40:56serhiy.storchakasetresolution: not a bug

messages: + msg358757
nosy: + serhiy.storchaka
2019-12-21 03:32:54william.aydcreate