Message 124894 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	belopolsky, ezio.melotti, georg.brandl, lemburg, mgiuca, pitrou, vstinner
Date	2010-12-30.00:45:27
SpamBayes Score	2.5016628e-10
Marked as misclassified	No
Message-id	<1293669928.94.0.203002875235.issue8821@psf.upfronthosting.co.za>
In-reply-to

Content
> Unicode objects are NUL-terminated, but only very external APIs > rely on this (e.g. code using the Windows Unicode API). All Py_UNICODE_str() functions rely on the NUL character. They are useful when patching a function from bytes (char) to unicode (PyUnicodeObject): the API is very close. It is possible to avoid them with new functions using the strings length. All functions using PyUNICODE* as wchar_t* to the Windows wide character API (*W functions) also rely on the NUL character. Python core uses a lot of these functions. Don't write a NUL character require to create a temporary new string ending with a NUL character. It is not efficient, especially on long strings. And there is the problem of all third party modules (written in C) relying on the NUL character. I think that we have good reasons to not remove the NUL character. So I think that we can continue to accept that unicode[length] character can be read. Eg. implement text.startswith("ab") as "p=PyUnicode_AS_UNICODE(text); if (p[0] == 'a' && p[1] == 'b')" without checking the length of text. Using the NUL character or the length as a terminator condition doesn't really matter. I just see one advantage for the NUL character: it is faster in some cases.

> Unicode objects are NUL-terminated, but only very external APIs
> rely on this (e.g. code using the Windows Unicode API).

All Py_UNICODE_str*() functions rely on the NUL character. They are useful when patching a function from bytes (char*) to unicode (PyUnicodeObject): the API is very close. It is possible to avoid them with new functions using the strings length.

All functions using PyUNICODE* as wchar_t* to the Windows wide character API (*W functions) also rely on the NUL character. Python core uses a lot of these functions. Don't write a NUL character require to create a temporary new string ending with a NUL character. It is not efficient, especially on long strings.

And there is the problem of all third party modules (written in C) relying on the NUL character.

I think that we have good reasons to not remove the NUL character. So I think that we can continue to accept that unicode[length] character can be read. Eg. implement text.startswith("ab") as "p=PyUnicode_AS_UNICODE(text); if (p[0] == 'a' && p[1] == 'b')" without checking the length of text.

Using the NUL character or the length as a terminator condition doesn't really matter. I just see one advantage for the NUL character: it is faster in some cases.

History
Date	User	Action	Args
2010-12-30 00:45:29	vstinner	set	recipients: + vstinner, lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mgiuca
2010-12-30 00:45:28	vstinner	set	messageid: <1293669928.94.0.203002875235.issue8821@psf.upfronthosting.co.za>
2010-12-30 00:45:27	vstinner	link	issue8821 messages
2010-12-30 00:45:27	vstinner	create