Issue 2799: Remove _PyUnicode_AsString(), rework _PyUnicode_AsStringAndSize(), add PyUnicode_AsChar()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47048

classification

Title:	Remove _PyUnicode_AsString(), rework _PyUnicode_AsStringAndSize(), add PyUnicode_AsChar()
Type:	enhancement	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 3.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	alexandre.vassalotti, bhy, ezio.melotti, jak, jpe, lemburg, loewis, scoder, vstinner
Priority:	normal	Keywords:

Created on 2008-05-09 10:31 by lemburg, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (14)
msg66463 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-05-09 10:31
The API PyUnicode_AsString() is pretty useless by itself - there's no way to access the size information of the returned string without again going to the Unicode object. I'd suggest to remove the API altogether and not only deprecating it. Furthermore, the API PyUnicode_AsStringAndSize() does not follow the API signature of PyString_AsStringAndSize() in that it passes back the pointer to the string as output parameter. That should be changed as well. Note that PyString_AsStringAndSize() already does this for both 8-bit strings and Unicode, so the special Unicode API is not really needed at all or you may want to rename PyString_AsStringAndSize() to PyUnicode_AsStringAndSize(). Finally, since there are many cases where the string buffer contents are copied to a new buffer, it's probably worthwhile to add a new API which does the copying straight away and also deals with the overflow cases in a central place. I'd suggest PyUnicode_AsChar() (with an API like PyUnicode_AsWideChar()). (this was taken from a comment on #1950)
msg66498 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2008-05-09 22:45
Honestly, I am not sure if removing PyUnicode_AsString() is a good idea. There is many cases where the size of the returned string is not needed. Furthermore, this would be a rather major backward-incompatible change to be included in a beta release. [copied from duplicate issue #2807]
msg66526 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-05-10 14:11
IMO, it's better to correct API design errors early, rather than going through a deprecation process. Note that PyUnicode_AsString() is also different than its cousind PyString_AsString(). PyString_AsString() is mostly used to access the char* buffer used by the string object in order to change it, e.g. by first constructing a new PyString object and then filling it in by accessing the internal char* buffer directly. Doing the same with PyUnicode_AsString() will not work. What's worse: direct changes would go undetected, since the UTF8 PyString object is held by the PyUnicode object internally. Even if you just use PyUnicode_AsString() for reading and get the size information from somewhere else, the API doesn't make sure that the PyUnicode object doesn't have embedded 0 code points (which PyString_AsString() does). PyUnicode_AsString() would have to use PyString_AsString() for this instead of the PyString_AS_STRING() macro.
msg67251 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-05-23 15:47
I don't agree that PyUnicode_AsString is useless. There are many cases where you don't need the length of the string, e.g. when relying on NULL termination when passing stuff to some C library. I suggest to close this report as "works for me". As for the unrelated issue of PyUnicode_AsStringAndSize: AFAICT, PyString_AsStringAndSize doesn't support Unicode objects (and IMO shouldn't, either). Making PyUnicode_AsStringAndSize and PyString_AsStringAndSize similar is probably a good idea.
msg67721 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2008-06-05 19:14
I now think the proposed changes wouldn't be bad thing, after all. I have been bitten myself by the confusing naming of the Unicode API. So, there is definitely a potential for errors. The main problem with PyUnicode_AsString(), as Marc-André pointed out, is it doesn't follow the API signature of the rest of the Unicode API: char PyUnicode_AsString(PyObject unicode); PyObject PyUnicode_AsUTF8String(PyObject unicode); PyObject PyUnicode_AsASCIIString(PyObject unicode); On the other hand, I do like the simple API of PyUnicode_AsString. Also, I have to admit that the apparent similarity between the PyString and the PyUnicode API helped me to port my code to Py3K when I first started working on Python core. So, pragmatism might beat purity here.
msg67726 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-06-05 20:45
On 2008-06-05 21:14, Alexandre Vassalotti wrote: > Alexandre Vassalotti <alexandre@peadrop.com> added the comment: > > I now think the proposed changes wouldn't be bad thing, after all. I > have been bitten myself by the confusing naming of the Unicode API. So, > there is definitely a potential for errors. > > The main problem with PyUnicode_AsString(), as Marc-André pointed out, > is it doesn't follow the API signature of the rest of the Unicode API: > > char PyUnicode_AsString(PyObject unicode); > PyObject PyUnicode_AsUTF8String(PyObject unicode); > PyObject PyUnicode_AsASCIIString(PyObject unicode); > > On the other hand, I do like the simple API of PyUnicode_AsString. Also, > I have to admit that the apparent similarity between the PyString and > the PyUnicode API helped me to port my code to Py3K when I first started > working on Python core. So, pragmatism might beat purity here. There are a few cases in the interpreter where it is indeed useful to have direct access to the buffer with the default encoded (= UTF-8 in Py3k) char* buffer. However, the naming of the API is poorly chosen, since the other PyUnicode_AsXYZ() APIs either return a PyObject* or copy the data to an output variable. How about PyUnicode_GetUTF8Buffer() or just PyUnicode_UTF8() ?! Note that the function must check the UTF-8 buffer for embedded NUL bytes and then raise an exception if it finds one. Otherwise, the API would silently cause truncations.
msg67727 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-06-05 20:50
> How about PyUnicode_GetUTF8Buffer() or just PyUnicode_UTF8() ?! -1 > Note that the function must check the UTF-8 buffer for embedded > NUL bytes and then raise an exception if it finds one. Otherwise, > the API would silently cause truncations. PyString_AsString doesn't check for null bytes, either, and will also silently truncate. This has never been a problem, so I fail to see why it is a problem for Unicode strings.
msg67729 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-06-05 21:06
On 2008-06-05 22:50, Martin v. Löwis wrote: >> Note that the function must check the UTF-8 buffer for embedded >> NUL bytes and then raise an exception if it finds one. Otherwise, >> the API would silently cause truncations. > > PyString_AsString doesn't check for null bytes, either, and will also > silently truncate. This has never been a problem, so I fail to see why > it is a problem for Unicode strings. Just because a bug hasn't surfaced yet, doesn't make it a non-issue. The problem is also somewhat different for Unicode: Unlike PyString_AsString() a Unicode API PyUnicode_UTF8() would not provide easy access to the length of the returned char*. And there is no PyString_GET_SIZE() you could use to quickly verify that there are no embedded NULs. Which is why using PyUnicode_AsStringAndSize() is the overall better and safer solution.
msg67757 - (view)	Author: Stefan Behnel (scoder) *	Date: 2008-06-06 08:38
While PyUnicode_AsStringAndSize() may be a better solution if the length is required, PyUnicode_AsString is enough() when it is not required. So I don't buy that argument. Since there are dedicated UTF-8 encoding functions, both functions are pure convenience anyway. Embedded \0 bytes can bite you, but that's completely unrelated to the issue discussed here. I wouldn't oppose renaming the function, but I don't see why it should go.
msg102208 - (view)	Author: John Ehresman (jpe) *	Date: 2010-04-02 22:22
I'm trying to port an existing C extension to py3k and find myself wanting something like PyUnicode_AsString so I don't need to introduce other objects to do memory management. PyUnicode_AsString is equivalent to PyArg_Parse w/ a 's' format code, which I find hard to believe will be removed. Another bug proposes changing the name and passing in a default value, which may be a good idea.
msg102243 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-04-03 11:47
Updating the ticket title to what we actually have in SVN (I had renamed the APIs to mark them as private to the interpreter some time ago).
msg123552 - (view)	Author: Julian Andres Klode (jak)	Date: 2010-12-07 14:18
The problem I see here is that there is no public way to simply get a C string from a unicode object similar to PyBytes_AsString() for bytes. That's bad because we don't want to rewrite the whole code to duplicate strings all the time and free every string we get from a MyPyUnicode_AsString() like function. I used the following, but this clearly has a memory leak: static const char MyPyUnicode_AsString(PyObject op) { PyObject bytes = PyUnicode_AsEncodedString(op,0,0); return bytes ? PyBytes_AS_STRING(bytes) : 0; } I now use the following which has no memory leak, but needs an internal function (I would use _PyUnicode_AsString, but I need Python 2.X compatibility as well): static const char MyPyUnicode_AsString(PyObject op) { PyObject bytes = _PyUnicode_AsDefaultEncodedString(op, 0); return bytes ? PyBytes_AS_STRING(bytes) : 0; } So could something be done about this?
msg144624 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-09-29 19:49
The PEP 393 changed the API: #define _PyUnicode_AsString PyUnicode_AsUTF8 #define _PyUnicode_AsStringAndSize PyUnicode_AsUTF8AndSize
msg204851 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2013-11-30 22:14
With PEP 393 implemented, there doesn't seem to anything left to be done here. Closing as fixed.

History
Date	User	Action	Args
2022-04-11 14:56:34	admin	set	github: 47048
2013-11-30 22:14:51	alexandre.vassalotti	set	status: open -> closed resolution: fixed messages: + msg204851 stage: needs patch -> resolved
2011-09-29 19:49:54	vstinner	set	messages: + msg144624
2010-12-07 14:18:03	jak	set	nosy: + jak messages: + msg123552
2010-11-16 16:45:18	belopolsky	set	nosy: lemburg, loewis, jpe, scoder, vstinner, alexandre.vassalotti, ezio.melotti, bhy stage: needs patch components: + Interpreter Core versions: + Python 3.3, - Python 3.0, Python 3.1
2010-04-03 11:47:20	lemburg	set	messages: + msg102243 title: Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar() -> Remove _PyUnicode_AsString(), rework _PyUnicode_AsStringAndSize(), add PyUnicode_AsChar()
2010-04-02 22:22:38	jpe	set	nosy: + jpe messages: + msg102208
2009-04-27 01:12:21	ajaksu2	set	priority: normal nosy: + vstinner, ezio.melotti type: enhancement versions: + Python 3.1
2008-06-06 08:38:09	scoder	set	nosy: + scoder messages: + msg67757
2008-06-05 21:06:58	lemburg	set	messages: + msg67729
2008-06-05 20:50:08	loewis	set	messages: + msg67727 title: Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar() -> Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar()
2008-06-05 20:45:34	lemburg	set	messages: + msg67726 title: Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar() -> Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar()
2008-06-05 19:14:39	alexandre.vassalotti	set	messages: + msg67721
2008-05-23 15:47:36	loewis	set	nosy: + loewis messages: + msg67251
2008-05-22 17:38:13	bhy	set	nosy: + bhy
2008-05-10 14:11:13	lemburg	set	messages: + msg66526
2008-05-09 22:45:14	alexandre.vassalotti	set	nosy: + alexandre.vassalotti messages: + msg66498
2008-05-09 22:43:17	alexandre.vassalotti	link	issue2807 superseder
2008-05-09 10:31:51	lemburg	create