Issue 17629: Expose string width to Python

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61829

classification

Title:	Expose string width to Python
Type:	enhancement	Stage:
Components:	Library (Lib), Unicode	Versions:	Python 3.4, Python 3.5

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	Rosuav, benjamin.peterson, ezio.melotti, loewis, roysmith, serhiy.storchaka, terry.reedy, vstinner
Priority:	normal	Keywords:

Created on 2013-04-03 21:54 by Rosuav, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (11)
msg185963 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2013-04-03 21:54
As of PEP 393, a string's width is recorded in its header - effectively, a marker that says whether the highest codepoint in the string is >0xFFFF, >0xFF, or <=0xFF. This is, on some occasions, useful to know; for instance, when testing string performance, it's handy to be able to very quickly throw something down that, without scanning the contents of all the strings used, can identify the width spread. A similar facility is provided by Pike, which has a similar flexible string representation: http://pike.lysator.liu.se/generated/manual/modref/ex/7.2_3A_3A/String/width.html accessible to a script as String.width(). Since this is not something frequently needed, it would make sense to hide it away in the sys or inspect modules, or possibly in strings or as a method on the string itself. Currently, the best way to do this is something like: def str_width(s): width=1 for ch in map(ord,s): if n > 0xFFFF: return 4 if n > 0xFF: width=2 return width which necessitates a scan of the entire string, unless it has an astral character.
msg185964 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2013-04-03 21:56
And of course, I make a copy/paste error in a trivial piece of example code. def str_width(s): width=1 for ch in map(ord,s): if ch > 0xFFFF: return 4 if ch > 0xFF: width=2 return width
msg185966 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-04-03 22:00
Not sure this is a good idea. The fact that CPython knows the width is an implementation detail. The method you suggested might not be too fast but it works, so we would be adding a new function/method just to optimize a fairly uncommon operation that depends on an implementation detail.
msg185969 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-04-03 22:09
"This is, on some occasions, useful to know; for instance, when testing string performance, it's handy to be able to very quickly throw something down that, without scanning the contents of all the strings used, can identify the width spread." When you test string performances, you can get manually the biggest code point using: max(map(ord, str)). Do you have an other use case?
msg185972 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2013-04-03 22:17
CPython also knows the length of a string, which means that len(s) is a fast operation. I wouldn't expect anyone to rewrite len() as: def get_string_length(s): length=0 for ch in s: length+=1 return length even though that works. No, we have a built-in function that's able to simply query the header. (Via a special method, but it's still ultimately the same thing.) The only other use-case that I have at the moment relates to a query on python-list about a broken MySQL, which was unable to handle astral characters. With a simple and fast way to query the string's width, this could have been checked for at practically zero cost; as it is, he had to scan all his strings to find out what was in them. It's not something that's of immense value, but it's a handy introspection, and one that should cost little or nothing to provide.
msg185973 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-04-03 22:27
max(map(ord, s)) or your str_width(s) are not much more complicated than s.width(), just slower, so it's just a problem of performance. len() is a much more common operation so it makes sense to have a fast built-in function.
msg185992 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-04-04 01:55
ord(max(s)) == max(map(ord,s)) == ord(max(s, key=ord)) Using a*30000000 and mental counting, the first is clearly fastest (about 2 seconds) with a 3.4 build, which has the optimized string comparison patches from last October. The reduction to 3 categories takes almost no time. I'm -1 without some real use case. For most testing, the data are constructed, so we already know the CPython internal width. There is no comparison in importance between and len/__len__. bool(x) calls x.__len__ if no x.__bool__. strings and other builtin collection classes have no __bool__. > .. cost little or nothing Every addition has a real cost: for developers, write code, write test, test test, write doc, maintain; for users, more to learn and understand -- and forget. I doubt the value of the compute time saved would ever come close to the value of the human time expended. There is also a cost to adding something CPython-specific.
msg185999 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2013-04-04 02:51
This is an implementation detail we don't want to expose. (It might change someday!)
msg186120 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-04-06 09:07
See also the discussion at http://comments.gmane.org/gmane.comp.python.ideas/15640 . I agree with rejection. This is an implementation detail and different Python implementations (including future CPython versions) can have different internal string implementations.
msg186169 - (view)	Author: Roy Smith (roysmith)	Date: 2013-04-06 22:59
I'm the guy who was searching for astral characters in msg18597. I should mention that while what I did was certainly inefficient, the database was so much slower that it didn't have any observable impact on the overall process time (a bit over 2 days to insert approximately 200 million rows) I see this has already been closed, but figured I should record my observation for the historical record.
msg186170 - (view)	Author: Roy Smith (roysmith)	Date: 2013-04-06 23:01
Um, make that msg185972.

History
Date	User	Action	Args
2022-04-11 14:57:43	admin	set	github: 61829
2013-04-06 23:01:18	roysmith	set	messages: + msg186170
2013-04-06 22:59:46	roysmith	set	nosy: + roysmith messages: + msg186169
2013-04-06 09:07:51	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg186120
2013-04-04 02:51:40	benjamin.peterson	set	status: open -> closed nosy: + benjamin.peterson messages: + msg185999 resolution: rejected
2013-04-04 01:55:20	terry.reedy	set	nosy: + terry.reedy messages: + msg185992
2013-04-03 22:27:24	ezio.melotti	set	messages: + msg185973
2013-04-03 22:17:53	Rosuav	set	messages: + msg185972
2013-04-03 22:09:19	vstinner	set	messages: + msg185969
2013-04-03 22:00:53	ezio.melotti	set	nosy: + loewis, vstinner, ezio.melotti messages: + msg185966 components: + Unicode type: enhancement
2013-04-03 21:56:35	Rosuav	set	messages: + msg185964
2013-04-03 21:54:31	Rosuav	create