array.array of UCS2 values #59240

ronaldoussoren · 2012-06-08T09:22:50Z

BPO	15035
Nosy	@loewis, @ronaldoussoren, @ncoghlan, @tiran, @methane, @skrah

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2012-06-08.09:22:50.343>
labels = ['extension-modules', 'type-bug']
title = 'array.array of UCS2 values'
updated_at = <Date 2019-04-13.11:46:41.362>
user = 'https://github.com/ronaldoussoren'

bugs.python.org fields:

activity = <Date 2019-04-13.11:46:41.362>
actor = 'methane'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Extension Modules']
creation = <Date 2012-06-08.09:22:50.343>
creator = 'ronaldoussoren'
dependencies = []
files = []
hgrepos = []
issue_num = 15035
keywords = []
message_count = 7.0
messages = ['162520', '162521', '162522', '168374', '168376', '168378', '168379']
nosy_count = 7.0
nosy_names = ['loewis', 'ronaldoussoren', 'ncoghlan', 'christian.heimes', 'Arfrever', 'methane', 'skrah']
pr_nums = []
priority = 'high'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue15035'
versions = ['Python 3.4']

ronaldoussoren · 2012-06-08T09:22:49Z

I'm sometimes using an array.array with format character "u" as a writable backing store for buffers shared with platform APIs that access buffers of UCS2 values. This works fine in python 3.2 and earlier with a ucs2 build of python, but no longer works with python 3.3 because the "u" character explicitly selects a UCS4 representation in that version.

An example of how I use this is using PyObjC on MacOSX, for example:

b = array.array('u', "hello world")
s = CFStringCreateMutableWithExternalCharactersNoCopy(                      
        None, b, len(b), len(b), kCFAllocatorNull)

"s" now refers to a mutable Objective-C string that uses "b" as its backing store.

It would be nice if there were a format code that would allow me to do this with Python 3.3, for example b = array.array("U", ...)

(BTW. I'm sorry if this is a duplicate, searching for "array.array" on the tracker results in a lot of hits, most of which have nothing to do with the array module)

skrah · 2012-06-08T09:46:30Z

See also bpo-13072 and the discussion starting at:

http://mail.python.org/pipermail/python-dev/2012-March/117390.html

I think the priority should be "high", since the current behavior
doesn't preserve the status quo. Also, PEP-3118 suggests 'u' for
UCS2 and 'w' for UCS4.

skrah · 2012-06-08T09:48:02Z

Hmm, obviously the discussion starts here:

http://mail.python.org/pipermail/python-dev/2012-March/117376.html

skrah · 2012-08-16T11:47:12Z

This one should be fixed by bpo-13072. Could you check again?

ncoghlan · 2012-08-16T11:54:36Z

As Stefan noted, so long as Py_UNICODE is 16 bits in the Mac OS X builds, then this should now be back to the 3.2 behaviour.

loewis · 2012-08-16T12:07:14Z

It's not back to the 3.2 behavior. In 3.3, Py_UNICODE is always equal to wchar_t, which is a 4-byte type on Darwin. However, CFString is based on UniChar, which is a 2-byte type.

That this worked in 3.2 was by accident - it would work only in "narrow" builds. Python's configure in 3.2 and before wouldn't default to using wchar_t on Darwin since it didn't consider wchar_t "usable", which in turn happened because wchar_t is signed on Darwin, but Py_UNICODE was understood to be unsigned.

Since it's too late to add an 'U' code to 3.3, as a work-around, you would have to use a 'H' array, and initialize it with map(ord, the_string)).

Chances are good that a proper UCS-2 array code gets added to 3.4.

ronaldoussoren · 2012-08-16T12:09:01Z

Py_UNICODE is an typedef for wchar_t and that type is 4 bytes long:

>>> a.tobytes()
b'h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> a = array.array('u', 'bar')
>>> a.tobytes()
b'b\x00\x00\x00a\x00\x00\x00r\x00\x00\x00'
>>> len(a.tobytes())
12
>>>

This is with a checkout that was created yesterday.

The issue is not resolved, there now is no way to easily create a UCS2 buffer; while there was in earlier releases of Python (with the default narrow build)

ronaldoussoren added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Jun 8, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

array.array of UCS2 values #59240

array.array of UCS2 values #59240

ronaldoussoren commented Jun 8, 2012

ronaldoussoren commented Jun 8, 2012

skrah mannequin commented Jun 8, 2012

skrah mannequin commented Jun 8, 2012

skrah mannequin commented Aug 16, 2012

ncoghlan commented Aug 16, 2012

loewis mannequin commented Aug 16, 2012

ronaldoussoren commented Aug 16, 2012

array.array of UCS2 values #59240

array.array of UCS2 values #59240

Comments

ronaldoussoren commented Jun 8, 2012

ronaldoussoren commented Jun 8, 2012

skrah mannequin commented Jun 8, 2012

skrah mannequin commented Jun 8, 2012

skrah mannequin commented Aug 16, 2012

ncoghlan commented Aug 16, 2012

loewis mannequin commented Aug 16, 2012

ronaldoussoren commented Aug 16, 2012