This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Expose string width to Python
Type: enhancement Stage:
Components: Library (Lib), Unicode Versions: Python 3.4, Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Rosuav, benjamin.peterson, ezio.melotti, loewis, roysmith, serhiy.storchaka, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2013-04-03 21:54 by Rosuav, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (11)
msg185963 - (view) Author: Chris Angelico (Rosuav) * Date: 2013-04-03 21:54
As of PEP 393, a string's width is recorded in its header - effectively, a marker that says whether the highest codepoint in the string is >0xFFFF, >0xFF, or <=0xFF. This is, on some occasions, useful to know; for instance, when testing string performance, it's handy to be able to very quickly throw something down that, without scanning the contents of all the strings used, can identify the width spread.

A similar facility is provided by Pike, which has a similar flexible string representation: http://pike.lysator.liu.se/generated/manual/modref/ex/7.2_3A_3A/String/width.html accessible to a script as String.width().

Since this is not something frequently needed, it would make sense to hide it away in the sys or inspect modules, or possibly in strings or as a method on the string itself.

Currently, the best way to do this is something like:

def str_width(s):
  width=1
  for ch in map(ord,s):
    if n > 0xFFFF: return 4
    if n > 0xFF: width=2
  return width

which necessitates a scan of the entire string, unless it has an astral character.
msg185964 - (view) Author: Chris Angelico (Rosuav) * Date: 2013-04-03 21:56
And of course, I make a copy/paste error in a trivial piece of example code.

def str_width(s):
  width=1
  for ch in map(ord,s):
    if ch > 0xFFFF: return 4
    if ch > 0xFF: width=2
  return width
msg185966 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-03 22:00
Not sure this is a good idea.  The fact that CPython knows the width is an implementation detail.  The method you suggested might not be too fast but it works, so we would be adding a new function/method just to optimize a fairly uncommon operation that depends on an implementation detail.
msg185969 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-03 22:09
"This is, on some occasions, useful to know; for instance, when testing string performance, it's handy to be able to very quickly throw something down that, without scanning the contents of all the strings used, can identify the width spread."

When you test string performances, you can get manually the biggest code point using: max(map(ord, str)).

Do you have an other use case?
msg185972 - (view) Author: Chris Angelico (Rosuav) * Date: 2013-04-03 22:17
CPython also knows the length of a string, which means that len(s) is a fast operation. I wouldn't expect anyone to rewrite len() as:

def get_string_length(s):
  length=0
  for ch in s:
    length+=1
  return length

even though that works. No, we have a built-in function that's able to simply query the header. (Via a special method, but it's still ultimately the same thing.)

The only other use-case that I have at the moment relates to a query on python-list about a broken MySQL, which was unable to handle astral characters. With a simple and fast way to query the string's width, this could have been checked for at practically zero cost; as it is, he had to scan all his strings to find out what was in them. It's not something that's of immense value, but it's a handy introspection, and one that should cost little or nothing to provide.
msg185973 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-03 22:27
max(map(ord, s)) or your str_width(s) are not much more complicated than s.width(), just slower, so it's just a problem of performance.

len() is a much more common operation so it makes sense to have a fast built-in function.
msg185992 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-04-04 01:55
ord(max(s)) == max(map(ord,s)) == ord(max(s, key=ord))
Using a*30000000 and mental counting, the first is clearly fastest (about 2 seconds) with a 3.4 build, which has the optimized string comparison patches from last October. The reduction to 3 categories takes almost no time.

I'm -1 without some real use case. For most testing, the data are constructed, so we already know the CPython internal width.

There is no comparison in importance between and len/__len__. bool(x) calls x.__len__ if no x.__bool__. strings and other builtin collection classes have no __bool__.

> .. cost little or nothing
Every addition has a real cost: for developers, write code, write test, test test, write doc, maintain; for users, more to learn and understand -- and forget. I doubt the value of the compute time saved would ever come close to the value of the human time expended. There is also a cost to adding something CPython-specific.
msg185999 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-04-04 02:51
This is an implementation detail we don't want to expose. (It might change someday!)
msg186120 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-06 09:07
See also the discussion at http://comments.gmane.org/gmane.comp.python.ideas/15640 . I agree with rejection. This is an implementation detail and different Python implementations (including future CPython versions) can have different internal string implementations.
msg186169 - (view) Author: Roy Smith (roysmith) Date: 2013-04-06 22:59
I'm the guy who was searching for astral characters in msg18597.  I should mention that while what I did was certainly inefficient, the database was so much slower that it didn't have any observable impact on the overall process time (a bit over 2 days to insert approximately 200 million rows)

I see this has already been closed, but figured I should record my observation for the historical record.
msg186170 - (view) Author: Roy Smith (roysmith) Date: 2013-04-06 23:01
Um, make that msg185972.
History
Date User Action Args
2022-04-11 14:57:43adminsetgithub: 61829
2013-04-06 23:01:18roysmithsetmessages: + msg186170
2013-04-06 22:59:46roysmithsetnosy: + roysmith
messages: + msg186169
2013-04-06 09:07:51serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg186120
2013-04-04 02:51:40benjamin.petersonsetstatus: open -> closed

nosy: + benjamin.peterson
messages: + msg185999

resolution: rejected
2013-04-04 01:55:20terry.reedysetnosy: + terry.reedy
messages: + msg185992
2013-04-03 22:27:24ezio.melottisetmessages: + msg185973
2013-04-03 22:17:53Rosuavsetmessages: + msg185972
2013-04-03 22:09:19vstinnersetmessages: + msg185969
2013-04-03 22:00:53ezio.melottisetnosy: + loewis, vstinner, ezio.melotti
messages: + msg185966

components: + Unicode
type: enhancement
2013-04-03 21:56:35Rosuavsetmessages: + msg185964
2013-04-03 21:54:31Rosuavcreate