Issue 18236: str.isspace should use the Unicode White_Space property

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62436

classification

Title:	str.isspace should use the Unicode White_Space property
Type:	behavior	Stage:	patch review
Components:	Interpreter Core, Unicode	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Greg Price, belopolsky, ezio.melotti, lemburg, loewis, malin, martin.panter, r.david.murray, serhiy.storchaka, terry.reedy, vstinner
Priority:	normal	Keywords:	patch

Created on 2013-06-17 01:06 by belopolsky, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
42973dfea391.diff	belopolsky, 2013-06-23 23:15		review

Pull Requests
URL	Status	Linked	Edit
PR 16254	open	Greg Price, 2019-09-18 06:02

Repositories containing patches
https://bitbucket.org/alexander_belopolsky/cpython/commits/branch/issue-18236#issue-18236

Messages (18)
msg191303 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-17 01:06
ASCII information separators, hex codes 1C through 1F are classified as space: >>> all(c.isspace() for c in '\N{FS}\N{GS}\N{RS}\N{US}') True but int()/float() do not accept strings with leading or trailing separators: >>> int('123\N{RS}') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '123\x1e' This is probably because corresponding bytes values are not classified as whitespace: >>> any(c.encode().isspace() for c in '\N{FS}\N{GS}\N{RS}\N{US}') False
msg191612 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-06-21 21:54
You stated facts: what is your proposal? The fact that unicode calls characters 'space' does not make then whitespace as commonly understood, or as defined by C, or even as defined by the Unicode database. Unicode apparently has a WSpace property. According to the table in https://en.wikipedia.org/wiki/Whitespace_%28computer_science%29 1C - 1F are not included by that definition either. For ascii chars, that table matches the C definition, with \r included. So I think your implied proposal to treat them as whitespace (in strings but not bytes) should be rejected as invalid. For 3.x, the manual should specify that it follows the C definition of 'whitespace' (\r included) for bytes and the extended unicode definition for strings. >>> int('3\r') 3 >>> int('3\u00a0') 3 >>> int('3\u2000') 3 >>> int(b'3\r') 3 >>> int(b'3\u00a0') Traceback (most recent call last): File "<pyshell#10>", line 1, in <module> int(b'3\u00a0') ValueError: invalid literal for int() with base 10: '3\\u00a0'
msg191647 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-22 16:33
> You stated facts: what is your proposal? There is a bug somewhere. We cannot simultaneously have >>> '\N{RS}'.isspace() True and not accept '\N{RS}' as whitespace when parsing numbers. I believe int(x) should be equivalent to int(x.strip()). This is not the case now: >>> '123\N{RS}'.strip() '123' >>> int('123\N{RS}') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '123\x1e' The reason I did not clearly state my proposal is because I am not sure whether bytes.isspace or str.isspace is correct, but I don't see any justification for having them defined differently in the ASCII range.
msg191648 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-06-22 17:15
I see your point now. Since RS is not whitespace by any definition I knew of previously, why is RS.isspace True? Apparent answer: Doc says '''Return true if there are only whitespace characters in the string and there is at least one character, false otherwise. Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”.''' I suspect this is a more expansive definition than WSpace chars, which seems to be the one used by int(), but you could check the int code. Bytes docs says: "Whenever a bytes or bytearray method needs to interpret the bytes as characters (e.g. the is...() methods, split(), strip()), the ASCII character set is assumed (text strings use Unicode semantics)." This says to me that str.isxxx and bytes.isxxx should match on ascii chars and not otherwise. That would happen is the bytes methods check for all ascii and decoded to unicode and used str method. Since they do not match, bytes must do something different. I think there is one definite bug: the discrepancy between str.isspace and bytes.isspace. There is possibly another bug: the discrepancy between 'whitespace' for str.isspace and int/float. After pinning down the details, I think you should ask how to resolve these on py-dev, and which versions to patch.
msg191649 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-22 17:55
It looks like str.isspace() is incorrect. The proper definition of unicode whitespace seems to include 26 characters: # ================================================ 0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D> 0020 ; White_Space # Zs SPACE 0085 ; White_Space # Cc <control-0085> 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 180E ; White_Space # Zs MONGOLIAN VOWEL SEPARATOR 2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE 2028 ; White_Space # Zl LINE SEPARATOR 2029 ; White_Space # Zp PARAGRAPH SEPARATOR 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACE # Total code points: 26 http://www.unicode.org/Public/UNIDATA/PropList.txt Python's str.isspace() uses the following definition: "Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”." Information separators are swept in because they have bidirectional property "B": >>> unicodedata.bidirectional('\N{RS}') 'B' See also #10587.
msg191650 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-22 18:19
I did a little more investigation and it looks like information separators have been included in whitespace since unicode type was first implemented in Python: guido 11967 Fri Mar 10 22:52:46 2000 +0000: /* Returns 1 for Unicode characters having the type 'WS', 'B' or 'S', guido 11967 Fri Mar 10 22:52:46 2000 +0000: 0 otherwise. / guido 11967 Fri Mar 10 22:52:46 2000 +0000: guido 11967 Fri Mar 10 22:52:46 2000 +0000: int _PyUnicode_IsWhitespace(register const Py_UNICODE ch) guido 11967 Fri Mar 10 22:52:46 2000 +0000: { guido 11967 Fri Mar 10 22:52:46 2000 +0000: switch (ch) { guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x0009: / HORIZONTAL TABULATION / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x000A: / LINE FEED / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x000B: / VERTICAL TABULATION / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x000C: / FORM FEED / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x000D: / CARRIAGE RETURN / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x001C: / FILE SEPARATOR / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x001D: / GROUP SEPARATOR / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x001E: / RECORD SEPARATOR / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x001F: / UNIT SEPARATOR / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x0020: / SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x1680: / OGHAM SPACE MARK / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2000: / EN QUAD / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2001: / EM QUAD / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2002: / EN SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2003: / EM SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2004: / THREE-PER-EM SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2005: / FOUR-PER-EM SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2006: / SIX-PER-EM SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2007: / FIGURE SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2008: / PUNCTUATION SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2009: / THIN SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x200A: / HAIR SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x2028: / LINE SEPARATOR / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x202F: / NARROW NO-BREAK SPACE / guido 11967 Fri Mar 10 22:52:46 2000 +0000: case 0x3000: / IDEOGRAPHIC SPACE */ guido 11967 Fri Mar 10 22:52:46 2000 +0000: return 1; guido 11967 Fri Mar 10 22:52:46 2000 +0000: default: guido 11967 Fri Mar 10 22:52:46 2000 +0000: return 0; guido 11967 Fri Mar 10 22:52:46 2000 +0000: } guido 11967 Fri Mar 10 22:52:46 2000 +0000: } guido 11967 Fri Mar 10 22:52:46 2000 +0000: (hg blame -u -d -n -r 11967 Objects/unicodectype.c)
msg191652 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-22 18:35
Martin v. Löwis wrote at #13391 (msg147634): > I do think that _PyUnicode_IsWhitespace should use the White_Space > property (from PropList.txt). I'm not quite sure how they computed > that property (or whether it's manually curated). Since that's a > behavioral change, it can only go into 3.3. I am adding Martin and Ezio to the "nosy."
msg191687 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2013-06-23 07:59
I stand by that comment: IsWhiteSpace should use the Unicode White_Space property. Since FS/GS/RS/US are not in the White_Space property, it's correct that the int conversion fails. It's incorrect that .isspace() gives true. There are really several bugs here: - .isspace doesn't use the White_List property - int conversion ultimately uses Py_ISSPACE, which conceptually could deviate from the Unicode properties (as it is byte-based). This is not really an issue, since they indeed match. I propose to fix this by parsing PropList.txt, and generating _PyUnicode_IsWhitespace based on the White_Space property. For efficiency, it should also generate a fast-lookup array for the ASCII case, or just use _Py_ctype_table (with a comment that this table needs to match PropList White_Space). _Py_ascii_whitespace should go. Contributions are welcome.
msg191689 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-06-23 10:54
I agree with Martin. At the time Unicode was added to Python, there was no single Unicode property for white space, so I had to deduce this from the other available properties. Now that we have a white space property in Unicode, we should start using it. Fortunately, the difference in Python's set of white space chars and the ones having the Unicode white space property are minimal.
msg191706 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-23 15:15
I have updated the title to focus this issue on the behavior of str.isspace(). I'll pick up remaining int/float issues in #10581.
msg191739 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-23 23:34
I would like someone review this change: https://bitbucket.org/alexander_belopolsky/cpython/commits/92c187025d0a8a989d9f81f2cb4c96f4eecb81cb?at=issue-18236 The patch can go in without this optimization, but I think this is the right first step towards removing _Py_ascii_whitespace. I don't think there is a need to generate ASCII optimization in makeunicodedata. While technically Unicode stability policy only guarantees that White_Space property will not be removed from code point s that have it, I think there is little chance that they will ever add White_Space property to another code point in the ASCII range and if they do, I am not sure Python will have to follow.
msg191746 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2013-06-24 05:52
-1 on that patch. It's using trickery to implement the test, and it's not clear that it is actually as efficient as the previous version. The previous version was explicitly modified to use a table lookup for performance reasons. I'd be fine with not generating PyUnicode_IsWhiteSpace at all, but instead hand-coding it. I suspect that we might want to use more of PropList at some point, so an effort to parse it might not be wasted.
msg221919 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2014-06-29 23:48
For future reference, the code discussed above is in the following portion of the patch: -#define Py_UNICODE_ISSPACE(ch) \ - ((ch) < 128U ? _Py_ascii_whitespace[(ch)] : _PyUnicode_IsWhitespace(ch)) +#define Py_UNICODE_ISSPACE(ch) \ + ((ch) == ' ' \|\| \ + ((ch) < 128U ? (ch) - 0x9U < 5U : _PyUnicode_IsWhitespace(ch)))
msg230183 - (view)	Author: Martin Panter (martin.panter) *	Date: 2014-10-28 22:40
As uncovered in Issue 12855, str.splitlines() currently considers the FS, GS and RS (1C–1E), but not the US (1F), to be line breaks. It might be surprising if these are no longer considered white space but are still considered line breaks.
msg349213 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-08 04:51
I've gone and made a patch for this change: https://github.com/gnprice/cpython/commit/7dab9d879 Most of the work happens in the script Tools/unicode/makeunicode.py , and along the way I made several changes there that I found made it somewhat nicer to work on, and I think will help other people reading that script too. So I'd like to try to merge those improvements first. I've filed #37760 for those preparatory changes, and posted several PRs (GH-15128, GH-15129, GH-15130) as bite-sized pieces. These PRs can go in in any order. Please take a look! Reviews appreciated.
msg349214 - (view)	Author: Ma Lin (malin) *	Date: 2019-08-08 05:16
Greg, could you try this code after your patch? >>> import re >>> re.match(r'\s', '\x1e') <re.Match object; span=(0, 1), match='\x1e'> # <- before patch
msg349233 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-08 13:21
Good question! With the patch: >>> import re >>> re.match(r'\s', '\x1e') >>> In other words, the definition of the regexp r'\s' follows along. Good to know.
msg350414 - (view)	Author: Greg Price (Greg Price) *	Date: 2019-08-25 00:33
> I've gone and made a patch for this change Update: * The preparatory changes in #37760 are now almost all merged; GH-15265 is the one piece remaining, and I'd be grateful for a review. It's a generally straightforward and boring change that converts the main data structures of makeunicodedata.py from using length-18 tuples as records to using a dataclass, which I think makes subsequent changes that add features to that script much easier both to write and to review. * I have a slightly updated version of the fix itself, which differs mainly by adding a test: https://github.com/gnprice/cpython/commit/9b3bf6739 Comments welcome there too.

History
Date	User	Action	Args
2022-04-11 14:57:47	admin	set	github: 62436
2019-09-18 06:02:16	Greg Price	set	keywords: + patch stage: needs patch -> patch review pull_requests: + pull_request15849
2019-08-25 00:33:01	Greg Price	set	messages: + msg350414
2019-08-08 13:21:55	Greg Price	set	messages: + msg349233
2019-08-08 05:16:40	malin	set	nosy: + malin messages: + msg349214
2019-08-08 04:51:18	Greg Price	set	nosy: + Greg Price messages: + msg349213 versions: + Python 3.9, - Python 3.5
2014-10-28 22:40:05	martin.panter	set	nosy: + martin.panter messages: + msg230183
2014-06-29 23:48:58	belopolsky	set	messages: + msg221919
2014-06-29 23:38:42	belopolsky	set	keywords: - patch, needs review assignee: belopolsky -> stage: commit review -> needs patch versions: + Python 3.5, - Python 3.3, Python 3.4
2013-07-01 11:51:15	r.david.murray	set	nosy: + r.david.murray
2013-06-24 08:11:49	serhiy.storchaka	set	nosy: + serhiy.storchaka
2013-06-24 05:52:28	loewis	set	messages: + msg191746
2013-06-23 23:34:30	belopolsky	set	keywords: + needs review messages: + msg191739 components: + Interpreter Core, Unicode stage: commit review
2013-06-23 23:22:13	belopolsky	set	files: - 3ed5bb7fcee9.diff
2013-06-23 23:15:24	belopolsky	set	files: + 42973dfea391.diff
2013-06-23 15:16:02	belopolsky	set	files: - 5c934626d44d.diff
2013-06-23 15:15:29	belopolsky	set	files: + 3ed5bb7fcee9.diff
2013-06-23 15:15:05	belopolsky	set	assignee: belopolsky messages: + msg191706 title: int() and float() do not accept strings with trailing separators -> str.isspace should use the Unicode White_Space property
2013-06-23 14:13:54	belopolsky	set	files: + 5c934626d44d.diff keywords: + patch
2013-06-23 13:48:14	belopolsky	set	hgrepos: + hgrepo201
2013-06-23 11:44:16	vstinner	set	nosy: + vstinner
2013-06-23 10:54:21	lemburg	set	nosy: + lemburg messages: + msg191689
2013-06-23 07:59:48	loewis	set	messages: + msg191687
2013-06-22 18:35:27	belopolsky	set	nosy: + loewis, ezio.melotti messages: + msg191652
2013-06-22 18:19:51	belopolsky	set	messages: + msg191650
2013-06-22 17:55:10	belopolsky	set	messages: + msg191649
2013-06-22 17:15:05	terry.reedy	set	type: enhancement -> behavior messages: + msg191648 versions: + Python 3.3
2013-06-22 16:33:59	belopolsky	set	messages: + msg191647
2013-06-21 21:54:51	terry.reedy	set	versions: + Python 3.4 nosy: + terry.reedy messages: + msg191612 type: behavior -> enhancement
2013-06-17 01:06:30	belopolsky	create