classification
Title: Make PyUnicode_AsWideCharString() increase temporary
Type: enhancement Stage: resolved
Components: Extension Modules, Interpreter Core, Unicode Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, dabeaz, ezio.melotti, loewis, methane, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2012-10-16 20:14 by dabeaz, last changed 2019-04-13 15:34 by methane. This issue is now closed.

Messages (8)
msg173089 - (view) Author: David Beazley (dabeaz) Date: 2012-10-16 20:14
The PyUnicode_AsWideCharString() function is described as creating a new buffer of type wchar_t allocated by PyMem_Alloc() (which must be freed by the user).   However, if you use this function, it causes the size of the original string object to permanently increase.  For example, suppose you had some extension code like this:

static PyObject *py_receive_wchar(PyObject *self, PyObject *args) {
  PyObject *obj;
  wchar_t *s;
  Py_ssize_t len;

  if (!PyArg_ParseTuple(args, "U", &obj)) {
    return NULL;
  }
  if ((s = PyUnicode_AsWideCharString(obj, &len)) == NULL) {
    return NULL;
  }
  /* Do nothing */
  PyMem_Free(s);
  Py_RETURN_NONE;
}

Now, try an experiment (assume that the above extension function is available as 'receive_wchar'). 

>>> s = "Hell"*1000
>>> len(s)
4000
>>> import sys
>>> sys.getsizeof(s)
4049
>>> receive_wchar(s)
>>> sys.getsizeof(s)
20053
>>>

It seems that PyUnicode_AsWideCharString() may be filling in the wstr field of the associated PyASCIIObject structure from PEP393 (I haven't verified).  Once filled, it never seems to be discarded.

Background:  I am trying to figure out how to convert from Unicode to (wchar_t, int *) that doesn't cause a permanent increase in the memory footprint of the original Unicode object.  Also, I'm trying to stay away from deprecated Unicode APIs.
msg173090 - (view) Author: David Beazley (dabeaz) Date: 2012-10-16 20:17
I should quickly add, is there any way to simply have this function not keep the wchar_t buffer around afterwards?   That would be great.
msg173101 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-10-16 21:35
As stated, this is not a bug: there is no memory leak, nor any deviation from documented behavior.

You are right that it fills the wstr pointer, by calling PyUnicode_AsUnicodeAndSize in unicode_aswidechar, and then copying the data to a fresh buffer.

This is merely the simplest implementation; it's certainly possible to improve it. Contributions are welcome.

A number of things need to be considered:
- Computing the wstr size is somewhat expensive if on a 16-bit wchar_t system, since the result may need surrogate pairs.
- I would suggest that if possible, the wstr representation should be returned out of the unicode object (resetting wstr to NULL). This should produce the greatest reuse in code, yet avoid unnecessary copying.
- It's not possible to do so for strings where wstr is shared with the canonical representation (i.e. a UCS-2 string on 16-bit wchar_t, and a UCS-4 string on 32-bit wchar_t).
- I don't think wstr should be cleared if it was already filled when the function got called. Instead, wstr should only be returned if it was originally NULL.
msg173104 - (view) Author: David Beazley (dabeaz) Date: 2012-10-16 22:07
Maybe it's not a bug, but I still think it's undesirable.   Basically, you have a function that allocates a buffer, fills it with data, and allows the buffer to be destroyed.   Yet, as a side effect, it allocates a second buffer, fills it, and permanently attaches it to the original string object.  Thus it makes the size of the string object blow up to a size substantially larger than it was before with no way to reclaim memory other than to delete the whole string.

Maybe this is some sort of rare event that doesn't matter, but maybe there's some bit of C extension code that is trying to pass a wchar_t array off to some external library.  The extension writer is using the PyUnicode_AsWideCharString() function with the understanding that it creates a new array and that you have to destroy it.   They understand that it's not super fast to have to make a copy, but it's better than nothing.  What's unfortunate is that all of this attention to memory management doesn't reward the programmer as a copy gets left behind on the string object anyways.

For instance, I start with a 10 Megabyte string, I pass it through a C extension function, and now the string is mysteriously using 50 Megabytes of memory.

I think the idea of filling wstr, returning it and clearing it (if originally NULL) would definitely work here.   Actually, that's exactly what I want--don't fill in the wstr member if it's not set already.  That way, it's possible for C extensions to temporarily get the wstr buffer, do something, and then toss it away without affecting the original string.

Another suggestion: An API function to simply clear wstr and the UTF-8 representation could also work.   Again, this is for extension writers who want to pull data out of strings, but don't want to leave these memory side effects behind.
msg173106 - (view) Author: David Beazley (dabeaz) Date: 2012-10-16 22:14
Another note: the PyUnicode_AsUTF8String() doesn't leave the UTF-8 encoded byte string behind on the original string object.  I got into this thinking that PyUnicode_AsWideCharString() might have similar behavior.
msg228812 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-10-08 19:31
I tried to re-title to describe the enhancement proposal. There are multiple API proposals in the two messages above.
msg228842 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-10-09 09:08
I proposed to change PyUnicode_AsWideCharString() to not cache the result: see my issue #22323 "Rewrite PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(): don't cache the result".

See also my issues #22271 "Deprecate PyUnicode_AsUnicode(): emit a DeprecationWarning" and #22324 "Use PyUnicode_AsWideCharString() instead of PyUnicode_AsUnicode()".
msg340165 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-04-13 15:34
fixed by #30863
History
Date User Action Args
2019-04-13 15:34:39methanesetstatus: open -> closed
versions: + Python 3.8, - Python 3.5
messages: + msg340165

resolution: fixed
stage: needs patch -> resolved
2019-04-13 13:10:37methanesetnosy: + methane
2014-10-09 09:08:38vstinnersetmessages: + msg228842
2014-10-08 19:31:36terry.reedysetversions: + Python 3.5, - Python 3.3
type: enhancement

nosy: + terry.reedy
title: PyUnicode_AsWideCharString() increases string size -> Make PyUnicode_AsWideCharString() increase temporary
messages: + msg228812
stage: needs patch
2012-10-16 22:14:40dabeazsetmessages: + msg173106
2012-10-16 22:07:23dabeazsetmessages: + msg173104
2012-10-16 21:40:39Arfreversetnosy: + Arfrever
2012-10-16 21:35:52pitrousetnosy: + vstinner
2012-10-16 21:35:23loewissetnosy: + loewis
messages: + msg173101
2012-10-16 20:17:01dabeazsetmessages: + msg173090
2012-10-16 20:14:08dabeazcreate