> (however, a quick test suggests that PyUnicode_DecodeUTF8 is quite slower)

It's surprising that PyUnicode_DecodeUTF8() is quite slower than _PyUnicode_FromUCS1(). _PyUnicode_FromUCS1() calls ucs1lib_find_max_char() and then memcpy(). PyUnicode_DecodeUTF8() first tries ascii_decode() which is very similar than ucs1lib_find_max_char().

The difference is maybe that _PyUnicode_FromUCS1() copies all bytes at once (memcpy()), whereas ascii_decode() copies bytes while if the string is ASCII or not.
