classification
Title: string.strip Does Not Remove Zero-Width-Space (ZWSP)
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, loewis, mankyd, rhettinger
Priority: normal Keywords:

Created on 2011-11-13 00:45 by mankyd, last changed 2011-11-15 22:10 by loewis. This issue is now closed.

Messages (12)
msg147538 - (view) Author: Dave Mankoff (mankyd) Date: 2011-11-13 00:45
Title pretty much says it all. Simple test case:

>>> len(u' \t\r\n\u200B'.strip())
1

Should be zero.

Same problem in Python3:

>>> len(' \t\r\n\u200B'.strip())
1
msg147547 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-13 03:44
str.strip uses Py_UNICODE_ISSPACE that in turn uses _PyUnicode_IsWhitespace (see Objects/unicodetype_db.h#l3347), and according to the comment there it "Returns 1 for Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise."
The category of U+200B is 'Cf', and its bidirectional type is 'BN' so 0 is returned and the character is not stripped.

OTOH, Unicode defines the White_Space property and assigns it to 26 chars, whereas _PyUnicode_IsWhitespace includes 4 more chars (1C, 1D, 1E, 1F) that should probably be removed.

I'll close this issue because str.strip() is correct regarding U+200B.

@Martin
Do you think those 4 chars should be removed?
If so I'll open another issue.
msg147603 - (view) Author: Dave Mankoff (mankyd) Date: 2011-11-14 16:14
I appreciated the quick turnaround on this.

Perhaps I am misunderstanding the resolution. I understand that strip uses _PyUnicode_IsWhitespace, and that _PyUnicode_IsWhitespace "Returns 1 for Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise." However, perhaps this is where the functionality is missing?

Upon further inspection, it looks like there may be other missing white-space characters, such as U+FEFF, "Zero Width No-Break Space". Whatever unicode categories their in, they're still a form of white-space and should still be removed, no?

This was not the behavior I expected from strip(). 
This affects string.issspace() as well.  I now have to put var.strip().strip(u'\u200B\ufeff') anywhere I want to test for whitespace strings in all my future python code. (I was bit by exactly this issue in my code which is what caused me to file the issue in the first place.)
msg147605 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-14 16:30
I think those shouldn't be considered whitespace, so they shouldn't be stripped either.
Even if _PyUnicode_IsWhitespace doesn't match exactly the Unicode definition of White_Space, they both agree that ZWSP and ZWNBSP are not whitespace.  ZWNBSP is also used as BOM, and its usage as a zero-width space has been deprecated in favor of WORD JOINER (U+2060).  Similarly WJ is not considered a whitespace.
msg147606 - (view) Author: Dave Mankoff (mankyd) Date: 2011-11-14 16:39
But why are they not a space? I mean, they literally have the word space in their name and are used as separators between words. I can't really see any reason why you wouldn't want this behavior - there's not time when I would be thankful that strip removed all spaces except for ZWSP and the likes.

As to deprecation, yes, that is true, but they still exist and will continue to do so. (My issue arose when a 3rd party delivered an all whitespace string to me.)

I can't really debate this further as there's not much more to say. I hope the issue will be reconsidered. Thanks again for taking the time to discuss.
msg147634 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-14 21:33
> But why are they not a space?

Because the Unicode standard says they are not. We have a good tradition in Python to follow standards where they apply, and it appears that the Unicode standard is crystal clear that the characters in question are *not* white space. Why should we second-guess the Unicode consortium when discussing Unicode questions? See also

http://en.wikipedia.org/wiki/Whitespace_character

IOW: get the Unicode consortium to declare them as whitespace, and we happily follow.

Ezio: I do think that _PyUnicode_IsWhitespace should use the White_Space property (from PropList.txt). I'm not quite sure how they computed that property (or whether it's manually curated). Since that's a behavioral change, it can only go into 3.3.
msg147642 - (view) Author: Dave Mankoff (mankyd) Date: 2011-11-14 23:32
So I contacted the Unicode Technical Committee about the issue and received a promptly received a response back. They pointed that the ZWSP was, once upon a time considered white space but that was changed in Unicode 4.0.1

http://www.unicode.org/review/resolved-pri.html#pri21

One particular comment worth noting: "... for historical reasons the general category is still Zs (Space Separator)".

Perhaps this ticket can be changed to a feature request? In addition to stripping out whitespace, it is useful to remove any non-printable characters from a string (or know if a string contains any non-printable characters).

Perhaps a boolean keyword parameter, "control_chars" could be added to isspace and strip? Thus:

>>> u' \t\r\n\u200B'.isspace(control_chars=True)
True
msg147644 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-14 23:40
Making it a feature request would procedurally be ok. However, I'd immediately refuse that as feature creep. Use regular expressions for more advanced stripping than what the .strip method provides.
msg147645 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-11-14 23:41
I would also object to the feature creep.
msg147680 - (view) Author: Dave Mankoff (mankyd) Date: 2011-11-15 15:07
"Use regular expressions for more advanced stripping than what the .strip method provides."

So I guess this brings me back to my original issue. I'm not looking for particularly advanced stripping. I just want to remove all whitespace and other non-printing characters. I personally can never think of a time when I wouldn't want this (especially with isspace). Maybe in some applications, the control characters are useful and shouldn't be stripped, but I would argue that _that_ is the more advanced use case for most people.

Thus strip and isspace are now unusable methods in Python for common use cases. This seems unfortunate.

I can understand the claims of feature creep. I even understand that having isspace compare itself against non-whitespace characters may seem counter-intuitive on its face. But certainly there must be a satisfactory remedy here.
msg147684 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-15 15:24
> So I guess this brings me back to my original issue. I'm not looking 
> for particularly advanced stripping. I just want to remove all 
> whitespace and other non-printing characters.

.strip only strips whitespace.  Stripping non-printing characters and additional 'whitespace' is something that is too specific for a builtin method, especially because people might disagree on the characters that are considered whitespace and non-printing.

> Thus strip and isspace are now unusable methods in Python for common
> use cases. This seems unfortunate.

I believe they work fine for the common case -- in fact these methods have been around for years and no one complained.
Also Unicode has a number of more or less space-like characters that are not whitespace and whitespace chars that don't look like whitespace.
If one needs to strip a different set of (whitespace) chars, it's always possible to pass it to .strip or to define a new function like
def mystrip(s):
    return s.strip().strip(u'\u200B\ufeff')
msg147712 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-15 22:10
> Thus strip and isspace are now unusable methods in Python for common use cases.

Please recognize that you haven't demonstrated this at all. U+200B is
*not* a character that is common, not even remotely. It's a rare,
infrequent, unused character. In addition, in the common use case
of string.strip (i.e. get rid of leading an trailing white space for
rendering purposes), keeping it causes no harm since it's zero-width.

If you disagree with the resolution of this issue, you will have to
write a PEP.
History
Date User Action Args
2011-11-15 22:10:37loewissetmessages: + msg147712
2011-11-15 15:24:25ezio.melottisetmessages: + msg147684
2011-11-15 15:07:43mankydsetmessages: + msg147680
2011-11-14 23:41:47rhettingersetnosy: + rhettinger
messages: + msg147645
2011-11-14 23:40:02loewissetmessages: + msg147644
2011-11-14 23:32:17mankydsetmessages: + msg147642
2011-11-14 21:33:05loewissetmessages: + msg147634
2011-11-14 16:39:41mankydsetmessages: + msg147606
2011-11-14 16:30:42ezio.melottisetmessages: + msg147605
2011-11-14 16:14:54mankydsetmessages: + msg147603
2011-11-13 03:44:46ezio.melottisetstatus: open -> closed

assignee: ezio.melotti

nosy: + loewis
messages: + msg147547
resolution: not a bug
stage: resolved
2011-11-13 01:24:32pitrousetversions: + Python 3.3
2011-11-13 00:45:21mankydcreate