This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author belopolsky
Recipients belopolsky, terry.reedy
Date 2013-06-22.17:55:09
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1371923710.1.0.487747196864.issue18236@psf.upfronthosting.co.za>
In-reply-to
Content
It looks like str.isspace() is incorrect.  The proper definition of unicode whitespace seems to include 26 characters:

# ================================================

0009..000D    ; White_Space # Cc   [5] <control-0009>..<control-000D>
0020          ; White_Space # Zs       SPACE
0085          ; White_Space # Cc       <control-0085>
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
180E          ; White_Space # Zs       MONGOLIAN VOWEL SEPARATOR
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
2028          ; White_Space # Zl       LINE SEPARATOR
2029          ; White_Space # Zp       PARAGRAPH SEPARATOR
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

# Total code points: 26

http://www.unicode.org/Public/UNIDATA/PropList.txt

Python's str.isspace() uses the following definition: "Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”."

Information separators are swept in because they have bidirectional property "B":

>>> unicodedata.bidirectional('\N{RS}')
'B'

See also #10587.
History
Date User Action Args
2013-06-22 17:55:10belopolskysetrecipients: + belopolsky, terry.reedy
2013-06-22 17:55:10belopolskysetmessageid: <1371923710.1.0.487747196864.issue18236@psf.upfronthosting.co.za>
2013-06-22 17:55:10belopolskylinkissue18236 messages
2013-06-22 17:55:09belopolskycreate