classification
Title: Clarify status of O(1) indexing semantics of str objects
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.4, Python 3.5
process
Status: closed Resolution: later
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Jim.Jewett, Rosuav, docs@python, gvanrossum, ncoghlan, pitrou, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2014-06-05 11:02 by ncoghlan, last changed 2014-06-10 23:20 by Jim.Jewett. This issue is now closed.

Files
File name Uploaded Description Edit
issue21667_clarify_str_specification.rst ncoghlan, 2014-06-05 11:29 Add indexing note, remove inaccurate references to "characters" review
Messages (20)
msg219793 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-05 11:02
Based on the recent python-dev thread, I propose the following "CPython implementation detail" note in the "Strings" entry of https://docs.python.org/3/reference/datamodel.html#objects-values-and-types

"CPython currently guarantees O(1) access to arbitrary code points when indexing and slicing a string. Python implementations are required to index and slice strings as arrays of code points, but are not required to guarantee O(1) access to arbitrary locations within the string. This allows implementations to use variable width encodings for their internal string representation."
msg219794 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-06-05 11:05
str[a:b] returns a substring (characters), not an array of code points (numbers).
msg219795 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-05 11:06
Guido, I think we need your call on whether or not to add a note about string indexing algorithmic complexity to the language reference, and to approve the exact wording of such a note (my proposed wording is in my initial comment on this issue).
msg219796 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-05 11:08
No, Python doesn't expose Unicode characters in its data model at all, except in those cases where a code point happens to correspond directly with a character. A length 1 str instance represents a Unicode code point, not a Unicode character.
msg219797 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-05 11:09
Although, you're right, that section of the data model docs misuses the word "character" to mean something other than what it means in the Unicode spec :(
msg219798 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-06-05 11:10
"Python implementations are required to ..."

By the way, Python < 3.3 doesn't implement this requirement :-)
msg219799 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-05 11:17
Saying that ord() and chr() switch between characters and code points is just plain wrong, since characters may be represented as multiple code points.

We may also want to explicitly note that the Unicode normalisation is implementation dependendent, and that CPython doesn't normalise implicitly except where specifically noted (i.e. during lexical analysis).
msg219800 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-05 11:18
Right, narrow builds have long been broken - that's a large part of why this is now the requirement :)
msg219801 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-05 11:29
Patch attached that also addresses the characters vs code points confusion.
msg219802 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-05 11:29
I ducked the Unicode normalisation question for now, since that's a *different* can of worms :)
msg219805 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-06-05 12:02
Two things:
- I don't think it's very helpful to use the term "code point" without explaining or introducing it ("character" at least can be understood intuitively)
- The mention of slicing is ambiguous: is slicing suppoded to be O(1)? how is indexing related to slicing?
msg219809 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-05 12:34
If someone doesn't understand what "Unicode code point" means, that's going to be the least of their problems when it comes to implementing a conformant Python implementation. We could link to http://unicode.org/glossary/#code_point, but that doesn't really add much beyond "value from 0 to 0x10FFFF". If you try to dive into the formal Unicode spec instead, you end up in a twisty maze of definitions of things that are all closely related, but generally not the same thing (code positions, code units, code spaces, abstract characters, glyphs, graphemes, etc).

The main advantage of using the more formal "code point" over the informal "character" is that it discourages people from assuming they know what they are (with the usual mistaken assumption being that Unicode code points correspond directly to glyphs the way ASCII and Extended ASCII printable characters correspond to their glyphs). The rest of the paragraph then provides the mechanical details of the meaningful interpretations of them in Python (as length 1 strings and as numbers in a particular range) and the operations for translating between those two formats (chr and ord).

Fair point about the slicing - it may be better to just talk about indexing.
msg219810 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-06-05 12:37
Not sure what implementing a conformant Python implementation has to do with this; the language specification should be readable by any interested programmers, IMO.

> If you try to dive into the formal Unicode spec instead, you end up
> in a twisty maze of definitions of things that are all closely 
> related, but generally not the same thing (code positions, code units, 
> code spaces, abstract characters, glyphs, graphemes, etc).

That makes all the less useful to use the "proper term" instead of the more intuitive alternative :-)
msg219811 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-06-05 12:45
Then perhaps we need notes about algorithmic complexity of bytes, bytearray, list and tuple and dict indexing, set.add and set.discard, dict.__delitem__, list.pop, len(), + and += for all basic sequences and containers, memoryview() for bytes, bytearray and array.array, etc, etc.
msg219812 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-06-05 12:49
> Then perhaps we need notes about algorithmic complexity of bytes, bytearray, list and tuple and dict indexing, set.add and set.discard, dict.__delitem__, list.pop, len(), + and += for all basic sequences and containers, memoryview() for bytes, bytearray and array.array, etc, etc.

That would be awesome :-) Please open a new issue for that. We can use
for example these data:
https://wiki.python.org/moin/TimeComplexity

Such data should be close to the definition of the type, or even close
to the method. Maybe directly in the documentation of each method?
msg219824 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-06-05 16:32
I don't want the O(1) property explicitly denounced in the reference manual. It's fine if the manual is silent on this -- maybe someone can prove that it isn't a problem based on benchmarks of an alternate implementation, but until then, I'm skeptical -- after all we have a bunch of important APIs (indexing, slicing, find()/index(), the re module) that use integer indexes, and some important algorithms/patterns are based off this behavior.

E.g. searching a string for something, returning the position where it's found, and then continuing the search from that position. Even if the basic search uses find() or regex matching, there's still a position being returned and accepted, and if it took O(N) time to find that position in the representation, the whole algorithm could degenerate from O(N) to O(N**2).

I am fine with the changes related to code points.

For the pedants amongst us, surrogates are also code points. A surrogate pair is two code points that encode a single code point. Fortunately we don't have to deal with those any more outside codecs.
msg219936 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-06-07 13:22
New changeset 6ffb6909c439 by Nick Coghlan in branch '3.4':
Issue #21667: Clarify string data model description
http://hg.python.org/cpython/rev/6ffb6909c439

New changeset 7c120e77d6f7 by Nick Coghlan in branch 'default':
Merge issue #21667 from 3.4
http://hg.python.org/cpython/rev/7c120e77d6f7
msg219937 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-06-07 13:26
I've merged the character->code point clarifications, without the implementation detail section.

For the time being, that leaves "doesn't provide O(1) indexing of strings" as the kind of discrepancy that often makes an appearance in "differences from the CPython reference implementation" section that many alternative implementations include.
msg220208 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2014-06-10 23:02
I think the new wording is an improvement, but keeping the changes minimal left it in an awkward in-between state.

Proposal:

A string is a sequence of Unicode code points.  Strings can include any sequence of code points, including some which are semantically meaningless, or explicitly undefined.

Python doesn't have a :c:type:`char` type; a single code point is represented as a string of length ``1``.  The built-in function :func:`chr` translates an integer in the range ``U+0000 - U+10FFFF`` to the corresponding length ``1`` string object, and :func:`ord` does the reverse.

:meth:`str.encode` provides a concrete representation (as :class:`bytes` in the given text encoding) suitable for transport and communication with non-Python utilities.  :meth:`bytes.decode` decodes such byte sequences into text strings.
msg220211 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2014-06-10 23:20
And even my rewrite showed path dependency; a slight further improvement is to re-order encoding ahead of bytes.  I also added a paragraph that I hope answers the speed issue.

Proposal:

A string is a sequence of Unicode code points.  Strings can include any sequence of code points, including some which are semantically meaningless, or explicitly undefined.

Python doesn't have a :c:type:`char` type; a single code point is represented as a string of length ``1``.  The built-in function :func:`chr` translates an integer in the range ``U+0000 - U+10FFFF`` to the corresponding length ``1`` string object, and :func:`ord` does the reverse.

:meth:`str.encode` provides a concrete representation (in the given text encoding) as a :class:`bytes` object suitable for transport and communication with non-Python utilities.  :meth:`bytes.decode` decodes such byte sequences into text strings.

.. impl-detail::  There are no methods exposing the internal representation of code points within a string.  While the C-API provides some additional constraints on CPython, other implementations are free to use any representation that treats code points (as opposed to either code units or some normalized form of characters) as the unit of measure.
History
Date User Action Args
2014-06-10 23:20:46Jim.Jewettsetmessages: + msg220211
2014-06-10 23:02:39Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg220208
2014-06-07 13:26:31ncoghlansetstatus: open -> closed
type: enhancement
messages: + msg219937

resolution: later
stage: resolved
2014-06-07 13:22:28python-devsetnosy: + python-dev
messages: + msg219936
2014-06-05 16:32:42gvanrossumsetmessages: + msg219824
2014-06-05 12:51:05Rosuavsetnosy: + Rosuav
2014-06-05 12:49:03vstinnersetmessages: + msg219812
2014-06-05 12:45:52serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg219811
2014-06-05 12:37:41pitrousetmessages: + msg219810
2014-06-05 12:34:08ncoghlansetmessages: + msg219809
2014-06-05 12:02:59pitrousetnosy: + pitrou
messages: + msg219805
2014-06-05 11:29:45ncoghlansetmessages: + msg219802
2014-06-05 11:29:06ncoghlansetfiles: + issue21667_clarify_str_specification.rst

messages: + msg219801
2014-06-05 11:18:43ncoghlansetmessages: + msg219800
2014-06-05 11:17:54ncoghlansetmessages: + msg219799
2014-06-05 11:10:40vstinnersetmessages: + msg219798
2014-06-05 11:09:49ncoghlansetmessages: + msg219797
2014-06-05 11:08:38ncoghlansetmessages: + msg219796
2014-06-05 11:06:08ncoghlansetnosy: + gvanrossum
messages: + msg219795
2014-06-05 11:05:12vstinnersetnosy: + vstinner
messages: + msg219794
2014-06-05 11:02:22ncoghlancreate