Title: PyUnicode_AsUTF8AndSize Sometimes Segfaults With Incomplete Surrogate Pair
Created on 2019-12-21 03:32 by william.ayd, last changed 2019-12-21 16:48 by serhiy.storchaka. This issue is now closed.

testmodule.c william.ayd, 2019-12-21 03:32 Extension Module For Use in Identifying Segfault
msg358755 - (view) Author: (william.ayd) * Date: 2019-12-21 03:32
With the attached extension module, if I run the following in the REPL:

>>> import libtest
>>> libtest.error_if_not_utf8("foo")
>>> libtest.error_if_not_utf8("\ud83d")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
>>> libtest.error_if_not_utf8("foo")

Things seem OK. But the next invocation of

>>> libtest.error_if_not_utf8("\ud83d")

Then causes a segfault. Note that the order of the input seems important; simply repeating the call with the invalid surrogate doesn't cause the segfault
msg358757 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-12-21 05:40
Your function returns a borrowed reference. It xiuld cause  ceash even without calling PyUnicode_AsUTF8AndSize. Add Py_INCREF(str)
msg358762 - (view) Author: (william.ayd) * Date: 2019-12-21 07:15
Hmm my mistake - thanks!
