This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Remove _PyUnicode_AsString(), rework _PyUnicode_AsStringAndSize(), add PyUnicode_AsChar()
Type: enhancement Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: alexandre.vassalotti, bhy, ezio.melotti, jak, jpe, lemburg, loewis, scoder, vstinner
Priority: normal Keywords:

Created on 2008-05-09 10:31 by lemburg, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (14)
msg66463 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-09 10:31
The API PyUnicode_AsString() is pretty useless by itself - there's
no way to access the size information of the returned string without
again going to the Unicode object.

I'd suggest to remove the API altogether and not only deprecating it.

Furthermore, the API PyUnicode_AsStringAndSize() does not follow the API
signature of PyString_AsStringAndSize() in that it passes back the
pointer to the string as output parameter. That should be changed as
well. Note that PyString_AsStringAndSize() already does this for both
8-bit strings and Unicode, so the special Unicode API is not really
needed at all or you may want to rename PyString_AsStringAndSize() to
PyUnicode_AsStringAndSize().

Finally, since there are many cases where the string buffer contents are
copied to a new buffer, it's probably worthwhile to add a new API which
does the copying straight away and also deals with the overflow cases in
a central place. I'd suggest PyUnicode_AsChar() (with an API like
PyUnicode_AsWideChar()).

(this was taken from a comment on #1950)
msg66498 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-05-09 22:45
Honestly, I am not sure if removing PyUnicode_AsString() is a good idea.
There is many cases where the size of the returned string is not needed.
Furthermore, this would be a rather major backward-incompatible change
to be included in a beta release.

[copied from duplicate issue #2807]
msg66526 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-10 14:11
IMO, it's better to correct API design errors early, rather than going
through a deprecation process.

Note that PyUnicode_AsString() is also different than its cousind
PyString_AsString(). 

PyString_AsString() is mostly used to access the char* buffer used by
the string object in order to change it, e.g. by first constructing a
new PyString object and then filling it in by accessing the internal
char* buffer directly.

Doing the same with PyUnicode_AsString() will not work. What's worse:
direct changes would go undetected, since the UTF8 PyString object is
held by the PyUnicode object internally.

Even if you just use PyUnicode_AsString() for reading and get the size
information from somewhere else, the API doesn't make sure that the
PyUnicode object doesn't have embedded 0 code points (which
PyString_AsString() does). PyUnicode_AsString() would have to use
PyString_AsString() for this instead of the PyString_AS_STRING() macro.
msg67251 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-05-23 15:47
I don't agree that PyUnicode_AsString is useless. There are many cases
where you don't need the length of the string, e.g. when relying on NULL
termination when passing stuff to some C library.

I suggest to close this report as "works for me".

As for the unrelated issue of PyUnicode_AsStringAndSize: AFAICT,
PyString_AsStringAndSize doesn't support Unicode objects (and IMO
shouldn't, either). 

Making PyUnicode_AsStringAndSize and PyString_AsStringAndSize similar is
probably a good idea.
msg67721 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-06-05 19:14
I now think the proposed changes wouldn't be bad thing, after all. I
have been bitten myself by the confusing naming of the Unicode API. So,
there is definitely a potential for errors. 

The main problem with PyUnicode_AsString(), as Marc-André pointed out,
is it doesn't follow the API signature of the rest of the Unicode API:

char *PyUnicode_AsString(PyObject *unicode);
PyObject *PyUnicode_AsUTF8String(PyObject *unicode);
PyObject *PyUnicode_AsASCIIString(PyObject *unicode);

On the other hand, I do like the simple API of PyUnicode_AsString. Also,
I have to admit that the apparent similarity between the PyString and
the PyUnicode API helped me to port my code to Py3K when I first started
working on Python core. So, pragmatism might beat purity here.
msg67726 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-06-05 20:45
On 2008-06-05 21:14, Alexandre Vassalotti wrote:
> Alexandre Vassalotti <alexandre@peadrop.com> added the comment:
> 
> I now think the proposed changes wouldn't be bad thing, after all. I
> have been bitten myself by the confusing naming of the Unicode API. So,
> there is definitely a potential for errors. 
> 
> The main problem with PyUnicode_AsString(), as Marc-André pointed out,
> is it doesn't follow the API signature of the rest of the Unicode API:
> 
> char *PyUnicode_AsString(PyObject *unicode);
> PyObject *PyUnicode_AsUTF8String(PyObject *unicode);
> PyObject *PyUnicode_AsASCIIString(PyObject *unicode);
> 
> On the other hand, I do like the simple API of PyUnicode_AsString. Also,
> I have to admit that the apparent similarity between the PyString and
> the PyUnicode API helped me to port my code to Py3K when I first started
> working on Python core. So, pragmatism might beat purity here.

There are a few cases in the interpreter where it is indeed useful
to have direct access to the buffer with the default encoded (= UTF-8
in Py3k) char* buffer.

However, the naming of the API is poorly chosen, since the other
PyUnicode_AsXYZ() APIs either return a PyObject* or copy the
data to an output variable.

How about PyUnicode_GetUTF8Buffer() or just PyUnicode_UTF8() ?!

Note that the function *must* check the UTF-8 buffer for embedded
NUL bytes and then raise an exception if it finds one. Otherwise,
the API would silently cause truncations.
msg67727 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-06-05 20:50
> How about PyUnicode_GetUTF8Buffer() or just PyUnicode_UTF8() ?!

-1

> Note that the function *must* check the UTF-8 buffer for embedded
> NUL bytes and then raise an exception if it finds one. Otherwise,
> the API would silently cause truncations.

PyString_AsString doesn't check for null bytes, either, and will also
silently truncate. This has never been a problem, so I fail to see why
it is a problem for Unicode strings.
msg67729 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-06-05 21:06
On 2008-06-05 22:50, Martin v. Löwis wrote:
>> Note that the function *must* check the UTF-8 buffer for embedded
>> NUL bytes and then raise an exception if it finds one. Otherwise,
>> the API would silently cause truncations.
> 
> PyString_AsString doesn't check for null bytes, either, and will also
> silently truncate. This has never been a problem, so I fail to see why
> it is a problem for Unicode strings.

Just because a bug hasn't surfaced yet, doesn't make it a non-issue.

The problem is also somewhat different for Unicode:

Unlike PyString_AsString() a Unicode API PyUnicode_UTF8() would not
provide easy access to the length of the returned char*.

And there is no PyString_GET_SIZE() you could use to quickly verify that
there are no embedded NULs.

Which is why using PyUnicode_AsStringAndSize() is the overall better
and safer solution.
msg67757 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2008-06-06 08:38
While PyUnicode_AsStringAndSize() may be a better solution if the length
is required, PyUnicode_AsString is enough() when it is not required. So
I don't buy that argument. Since there are dedicated UTF-8 encoding
functions, both functions are pure convenience anyway.

Embedded \0 bytes can bite you, but that's completely unrelated to the
issue discussed here.

I wouldn't oppose renaming the function, but I don't see why it should go.
msg102208 - (view) Author: John Ehresman (jpe) * Date: 2010-04-02 22:22
I'm trying to port an existing C extension to py3k and find myself wanting something like PyUnicode_AsString so I don't need to introduce other objects to do memory management.  PyUnicode_AsString is equivalent to PyArg_Parse w/ a 's' format code, which I find hard to believe will be removed.  Another bug proposes changing the name and passing in a default value, which may be a good idea.
msg102243 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-04-03 11:47
Updating the ticket title to what we actually have in SVN (I had renamed the APIs to mark them as private to the interpreter some time ago).
msg123552 - (view) Author: Julian Andres Klode (jak) Date: 2010-12-07 14:18
The problem I see here is that there is no public way to simply get a C string from a unicode object similar to PyBytes_AsString() for bytes. That's bad because we don't want to rewrite the whole code to duplicate strings all the time and free every string we get from a MyPyUnicode_AsString() like function.

I used the following, but this clearly has a memory leak:


  static const char *MyPyUnicode_AsString(PyObject *op) {
      PyObject *bytes = PyUnicode_AsEncodedString(op,0,0);
      return bytes ? PyBytes_AS_STRING(bytes) : 0;
  }

I now use the following which has no memory leak, but needs an internal function (I would use _PyUnicode_AsString, but I need Python 2.X compatibility as well):

  static const char *MyPyUnicode_AsString(PyObject *op) {
      PyObject *bytes = _PyUnicode_AsDefaultEncodedString(op, 0);
      return bytes ? PyBytes_AS_STRING(bytes) : 0;
  }

So could something be done about this?
msg144624 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-09-29 19:49
The PEP 393 changed the API:

#define _PyUnicode_AsString PyUnicode_AsUTF8
#define _PyUnicode_AsStringAndSize PyUnicode_AsUTF8AndSize
msg204851 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2013-11-30 22:14
With PEP 393 implemented, there doesn't seem to anything left to be done here. Closing as fixed.
History
Date User Action Args
2022-04-11 14:56:34adminsetgithub: 47048
2013-11-30 22:14:51alexandre.vassalottisetstatus: open -> closed
resolution: fixed
messages: + msg204851

stage: needs patch -> resolved
2011-09-29 19:49:54vstinnersetmessages: + msg144624
2010-12-07 14:18:03jaksetnosy: + jak
messages: + msg123552
2010-11-16 16:45:18belopolskysetnosy: lemburg, loewis, jpe, scoder, vstinner, alexandre.vassalotti, ezio.melotti, bhy
stage: needs patch
components: + Interpreter Core
versions: + Python 3.3, - Python 3.0, Python 3.1
2010-04-03 11:47:20lemburgsetmessages: + msg102243
title: Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar() -> Remove _PyUnicode_AsString(), rework _PyUnicode_AsStringAndSize(), add PyUnicode_AsChar()
2010-04-02 22:22:38jpesetnosy: + jpe
messages: + msg102208
2009-04-27 01:12:21ajaksu2setpriority: normal
nosy: + vstinner, ezio.melotti

type: enhancement
versions: + Python 3.1
2008-06-06 08:38:09scodersetnosy: + scoder
messages: + msg67757
2008-06-05 21:06:58lemburgsetmessages: + msg67729
2008-06-05 20:50:08loewissetmessages: + msg67727
title: Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar() -> Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar()
2008-06-05 20:45:34lemburgsetmessages: + msg67726
title: Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar() -> Remove PyUnicode_AsString(), rework PyUnicode_AsStringAndSize(), add PyUnicode_AsChar()
2008-06-05 19:14:39alexandre.vassalottisetmessages: + msg67721
2008-05-23 15:47:36loewissetnosy: + loewis
messages: + msg67251
2008-05-22 17:38:13bhysetnosy: + bhy
2008-05-10 14:11:13lemburgsetmessages: + msg66526
2008-05-09 22:45:14alexandre.vassalottisetnosy: + alexandre.vassalotti
messages: + msg66498
2008-05-09 22:43:17alexandre.vassalottilinkissue2807 superseder
2008-05-09 10:31:51lemburgcreate