Message 191649 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	belopolsky
Recipients	belopolsky, terry.reedy
Date	2013-06-22.17:55:09
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1371923710.1.0.487747196864.issue18236@psf.upfronthosting.co.za>
In-reply-to

Content
It looks like str.isspace() is incorrect. The proper definition of unicode whitespace seems to include 26 characters: # ================================================ 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D> 0020 ; White_Space # Zs SPACE 0085 ; White_Space # Cc <control-0085> 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 180E ; White_Space # Zs MONGOLIAN VOWEL SEPARATOR 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE 2028 ; White_Space # Zl LINE SEPARATOR 2029 ; White_Space # Zp PARAGRAPH SEPARATOR 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACE # Total code points: 26 http://www.unicode.org/Public/UNIDATA/PropList.txt Python's str.isspace() uses the following definition: "Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”." Information separators are swept in because they have bidirectional property "B": >>> unicodedata.bidirectional('\N{RS}') 'B' See also #10587.

It looks like str.isspace() is incorrect.  The proper definition of unicode whitespace seems to include 26 characters:

# ================================================

0009..000D    ; White_Space # Cc   [5] <control-0009>..<control-000D>
0020          ; White_Space # Zs       SPACE
0085          ; White_Space # Cc       <control-0085>
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
180E          ; White_Space # Zs       MONGOLIAN VOWEL SEPARATOR
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
2028          ; White_Space # Zl       LINE SEPARATOR
2029          ; White_Space # Zp       PARAGRAPH SEPARATOR
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

# Total code points: 26

http://www.unicode.org/Public/UNIDATA/PropList.txt

Python's str.isspace() uses the following definition: "Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”."

Information separators are swept in because they have bidirectional property "B":

>>> unicodedata.bidirectional('\N{RS}')
'B'

See also #10587.

History
Date	User	Action	Args
2013-06-22 17:55:10	belopolsky	set	recipients: + belopolsky, terry.reedy
2013-06-22 17:55:10	belopolsky	set	messageid: <1371923710.1.0.487747196864.issue18236@psf.upfronthosting.co.za>
2013-06-22 17:55:10	belopolsky	link	issue18236 messages
2013-06-22 17:55:09	belopolsky	create