Issue 13391: string.strip Does Not Remove Zero-Width-Space (ZWSP)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57600

classification

Title:	string.strip Does Not Remove Zero-Width-Space (ZWSP)
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.2, Python 3.3, Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	ezio.melotti, loewis, mankyd, rhettinger
Priority:	normal	Keywords:

Created on 2011-11-13 00:45 by mankyd, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (12)
msg147538 - (view)	Author: Dave Mankoff (mankyd)	Date: 2011-11-13 00:45
Title pretty much says it all. Simple test case: >>> len(u' \t\r\n\u200B'.strip()) 1 Should be zero. Same problem in Python3: >>> len(' \t\r\n\u200B'.strip()) 1
msg147547 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-11-13 03:44
str.strip uses Py_UNICODE_ISSPACE that in turn uses _PyUnicode_IsWhitespace (see Objects/unicodetype_db.h#l3347), and according to the comment there it "Returns 1 for Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise." The category of U+200B is 'Cf', and its bidirectional type is 'BN' so 0 is returned and the character is not stripped. OTOH, Unicode defines the White_Space property and assigns it to 26 chars, whereas _PyUnicode_IsWhitespace includes 4 more chars (1C, 1D, 1E, 1F) that should probably be removed. I'll close this issue because str.strip() is correct regarding U+200B. @Martin Do you think those 4 chars should be removed? If so I'll open another issue.
msg147603 - (view)	Author: Dave Mankoff (mankyd)	Date: 2011-11-14 16:14
I appreciated the quick turnaround on this. Perhaps I am misunderstanding the resolution. I understand that strip uses _PyUnicode_IsWhitespace, and that _PyUnicode_IsWhitespace "Returns 1 for Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise." However, perhaps this is where the functionality is missing? Upon further inspection, it looks like there may be other missing white-space characters, such as U+FEFF, "Zero Width No-Break Space". Whatever unicode categories their in, they're still a form of white-space and should still be removed, no? This was not the behavior I expected from strip(). This affects string.issspace() as well. I now have to put var.strip().strip(u'\u200B\ufeff') anywhere I want to test for whitespace strings in all my future python code. (I was bit by exactly this issue in my code which is what caused me to file the issue in the first place.)
msg147605 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-11-14 16:30
I think those shouldn't be considered whitespace, so they shouldn't be stripped either. Even if _PyUnicode_IsWhitespace doesn't match exactly the Unicode definition of White_Space, they both agree that ZWSP and ZWNBSP are not whitespace. ZWNBSP is also used as BOM, and its usage as a zero-width space has been deprecated in favor of WORD JOINER (U+2060). Similarly WJ is not considered a whitespace.
msg147606 - (view)	Author: Dave Mankoff (mankyd)	Date: 2011-11-14 16:39
But why are they not a space? I mean, they literally have the word space in their name and are used as separators between words. I can't really see any reason why you wouldn't want this behavior - there's not time when I would be thankful that strip removed all spaces except for ZWSP and the likes. As to deprecation, yes, that is true, but they still exist and will continue to do so. (My issue arose when a 3rd party delivered an all whitespace string to me.) I can't really debate this further as there's not much more to say. I hope the issue will be reconsidered. Thanks again for taking the time to discuss.
msg147634 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-11-14 21:33
> But why are they not a space? Because the Unicode standard says they are not. We have a good tradition in Python to follow standards where they apply, and it appears that the Unicode standard is crystal clear that the characters in question are not white space. Why should we second-guess the Unicode consortium when discussing Unicode questions? See also http://en.wikipedia.org/wiki/Whitespace_character IOW: get the Unicode consortium to declare them as whitespace, and we happily follow. Ezio: I do think that _PyUnicode_IsWhitespace should use the White_Space property (from PropList.txt). I'm not quite sure how they computed that property (or whether it's manually curated). Since that's a behavioral change, it can only go into 3.3.
msg147642 - (view)	Author: Dave Mankoff (mankyd)	Date: 2011-11-14 23:32
So I contacted the Unicode Technical Committee about the issue and received a promptly received a response back. They pointed that the ZWSP was, once upon a time considered white space but that was changed in Unicode 4.0.1 http://www.unicode.org/review/resolved-pri.html#pri21 One particular comment worth noting: "... for historical reasons the general category is still Zs (Space Separator)". Perhaps this ticket can be changed to a feature request? In addition to stripping out whitespace, it is useful to remove any non-printable characters from a string (or know if a string contains any non-printable characters). Perhaps a boolean keyword parameter, "control_chars" could be added to isspace and strip? Thus: >>> u' \t\r\n\u200B'.isspace(control_chars=True) True
msg147644 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-11-14 23:40
Making it a feature request would procedurally be ok. However, I'd immediately refuse that as feature creep. Use regular expressions for more advanced stripping than what the .strip method provides.
msg147645 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2011-11-14 23:41
I would also object to the feature creep.
msg147680 - (view)	Author: Dave Mankoff (mankyd)	Date: 2011-11-15 15:07
"Use regular expressions for more advanced stripping than what the .strip method provides." So I guess this brings me back to my original issue. I'm not looking for particularly advanced stripping. I just want to remove all whitespace and other non-printing characters. I personally can never think of a time when I wouldn't want this (especially with isspace). Maybe in some applications, the control characters are useful and shouldn't be stripped, but I would argue that _that_ is the more advanced use case for most people. Thus strip and isspace are now unusable methods in Python for common use cases. This seems unfortunate. I can understand the claims of feature creep. I even understand that having isspace compare itself against non-whitespace characters may seem counter-intuitive on its face. But certainly there must be a satisfactory remedy here.
msg147684 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-11-15 15:24
> So I guess this brings me back to my original issue. I'm not looking > for particularly advanced stripping. I just want to remove all > whitespace and other non-printing characters. .strip only strips whitespace. Stripping non-printing characters and additional 'whitespace' is something that is too specific for a builtin method, especially because people might disagree on the characters that are considered whitespace and non-printing. > Thus strip and isspace are now unusable methods in Python for common > use cases. This seems unfortunate. I believe they work fine for the common case -- in fact these methods have been around for years and no one complained. Also Unicode has a number of more or less space-like characters that are not whitespace and whitespace chars that don't look like whitespace. If one needs to strip a different set of (whitespace) chars, it's always possible to pass it to .strip or to define a new function like def mystrip(s): return s.strip().strip(u'\u200B\ufeff')
msg147712 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-11-15 22:10
> Thus strip and isspace are now unusable methods in Python for common use cases. Please recognize that you haven't demonstrated this at all. U+200B is not a character that is common, not even remotely. It's a rare, infrequent, unused character. In addition, in the common use case of string.strip (i.e. get rid of leading an trailing white space for rendering purposes), keeping it causes no harm since it's zero-width. If you disagree with the resolution of this issue, you will have to write a PEP.

History
Date	User	Action	Args
2022-04-11 14:57:23	admin	set	github: 57600
2011-11-15 22:10:37	loewis	set	messages: + msg147712
2011-11-15 15:24:25	ezio.melotti	set	messages: + msg147684
2011-11-15 15:07:43	mankyd	set	messages: + msg147680
2011-11-14 23:41:47	rhettinger	set	nosy: + rhettinger messages: + msg147645
2011-11-14 23:40:02	loewis	set	messages: + msg147644
2011-11-14 23:32:17	mankyd	set	messages: + msg147642
2011-11-14 21:33:05	loewis	set	messages: + msg147634
2011-11-14 16:39:41	mankyd	set	messages: + msg147606
2011-11-14 16:30:42	ezio.melotti	set	messages: + msg147605
2011-11-14 16:14:54	mankyd	set	messages: + msg147603
2011-11-13 03:44:46	ezio.melotti	set	status: open -> closed assignee: ezio.melotti nosy: + loewis messages: + msg147547 resolution: not a bug stage: resolved
2011-11-13 01:24:32	pitrou	set	versions: + Python 3.3
2011-11-13 00:45:21	mankyd	create