classification
Title: str.isspace() for U+00A0 and U+202F differs from document
Type: behavior Stage: resolved
Components: Documentation, Unicode Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Greg Price, Jun, Jun_, SilentGhost, benjamin.peterson, docs@python, ezio.melotti, lemburg, miss-islington, vstinner
Priority: normal Keywords: patch

Created on 2019-04-02 06:36 by Jun, last changed 2019-09-09 18:41 by miss-islington. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15019 merged Greg Price, 2019-08-03 08:10
PR 15296 merged Greg Price, 2019-08-15 03:17
PR 15301 merged Greg Price, 2019-08-15 04:48
PR 15332 merged miss-islington, 2019-08-19 09:53
PR 15806 merged miss-islington, 2019-09-09 16:37
PR 15807 merged miss-islington, 2019-09-09 16:37
PR 15808 merged benjamin.peterson, 2019-09-09 16:51
Messages (14)
msg339317 - (view) Author: Jun (Jun) Date: 2019-04-02 06:36
I was looking for a list of Unicode codepoints that str.isspace() returns true.

According to https://docs.python.org/3/library/stdtypes.html#str.isspace, it's 
"Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”."

However, for U+202F(https://www.fileformat.info/info/unicode/char/202f/index.htm) which is a "Separator" and its bidirectional property is "CS", str.isspace() returns True while it shouldn't if we follow the definition above. 

>>> "\u202f".isspace()
True

I'm not sure either the documentation should be updated or behavior should be updated, but at least those should be consistent.
msg339318 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2019-04-02 06:59
I think you have to read that "and" as "or". It's sufficient that '\u202f' is a separator for it to be considered a whitespace character.
msg339336 - (view) Author: Jun_ (Jun_) Date: 2019-04-02 14:32
Do you mean read the statement as follows?

Whitespace characters are characters that satisfy either one of:
1. Character type is "Other"
2. Character type is "Separator"
3. Characters with "WS", "B", or "S" bidirectional property

If that's the case, this is also not reflect the behavior as most of characters in "Other" are not whitespace characters and in fact str.isspace() returns False for those characters.
msg339339 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2019-04-02 14:56
According to comment for _PyUnicode_IsWhitespace it's supposed to include Zs category, plus documented BIDI properties. So, I'm not sure where "Other" came from.
msg348947 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-03 08:18
The actual behavior turns out to match that comment. See attached PR, which adds a test confirming that and also corrects the documentation.

(A related issue is #18236 -- we should probably adjust the definition to match the one Unicode now provides. But meanwhile we'll want to correct the docs.)
msg349678 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 11:05
New changeset 6bccbe7dfb998af862a183f2c36f0d4603af2c29 by Victor Stinner (Greg Price) in branch 'master':
bpo-36502: Correct documentation of str.isspace() (GH-15019)
https://github.com/python/cpython/commit/6bccbe7dfb998af862a183f2c36f0d4603af2c29
msg349947 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-19 09:53
New changeset 8c1c426a631ba02357112657193f82c58d3e08b4 by Victor Stinner (Greg Price) in branch '3.8':
bpo-36502: Correct documentation of str.isspace() (GH-15019) (GH-15296)
https://github.com/python/cpython/commit/8c1c426a631ba02357112657193f82c58d3e08b4
msg349948 - (view) Author: miss-islington (miss-islington) Date: 2019-08-19 10:10
New changeset 0fcdd8d6d67f57733203fc79e6a07a89b924a390 by Miss Islington (bot) in branch '3.7':
bpo-36502: Correct documentation of str.isspace() (GH-15019) (GH-15296)
https://github.com/python/cpython/commit/0fcdd8d6d67f57733203fc79e6a07a89b924a390
msg349950 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-19 10:14
str.isspace() documentation has been fixed, thanks Greg Price for the fix! I close the issue.
msg349983 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-20 01:33
Thanks Victor for the reviews and merges!

(Unmarking 2.7, because https://docs.python.org/2/library/stdtypes.html seems to not have this issue.)
msg351526 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-09 16:37
New changeset 64c6ac74e254d31f93fcc74bf02b3daa7d3e3f25 by Benjamin Peterson (Greg Price) in branch 'master':
bpo-36502: Update link to UAX #44, the Unicode doc on the UCD. (GH-15301)
https://github.com/python/cpython/commit/64c6ac74e254d31f93fcc74bf02b3daa7d3e3f25
msg351536 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-09 17:10
New changeset 58d61efd4cdece3b026868a66d829001198d29b1 by Benjamin Peterson in branch '2.7':
[2.7] bpo-36502: Update link to UAX GH-44, the Unicode doc on the UCD. (GH-15808)
https://github.com/python/cpython/commit/58d61efd4cdece3b026868a66d829001198d29b1
msg351545 - (view) Author: miss-islington (miss-islington) Date: 2019-09-09 18:40
New changeset 0a86da87da82c4a28d7ec91eb54c0b9ca40bbea7 by Miss Islington (bot) in branch '3.7':
bpo-36502: Update link to UAX GH-44, the Unicode doc on the UCD. (GH-15301)
https://github.com/python/cpython/commit/0a86da87da82c4a28d7ec91eb54c0b9ca40bbea7
msg351546 - (view) Author: miss-islington (miss-islington) Date: 2019-09-09 18:41
New changeset c1c04cbc24c11cd7a47579af3faffee05a16acd7 by Miss Islington (bot) in branch '3.8':
bpo-36502: Update link to UAX GH-44, the Unicode doc on the UCD. (GH-15301)
https://github.com/python/cpython/commit/c1c04cbc24c11cd7a47579af3faffee05a16acd7
History
Date User Action Args
2019-09-09 18:41:16miss-islingtonsetmessages: + msg351546
2019-09-09 18:40:08miss-islingtonsetmessages: + msg351545
2019-09-09 17:10:10benjamin.petersonsetmessages: + msg351536
2019-09-09 16:51:56benjamin.petersonsetpull_requests: + pull_request15459
2019-09-09 16:37:32miss-islingtonsetpull_requests: + pull_request15458
2019-09-09 16:37:26miss-islingtonsetpull_requests: + pull_request15457
2019-09-09 16:37:16benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg351526
2019-08-20 01:33:56Greg Pricesetmessages: + msg349983
versions: - Python 2.7
2019-08-19 10:14:33vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg349950

stage: patch review -> resolved
2019-08-19 10:10:23miss-islingtonsetnosy: + miss-islington
messages: + msg349948
2019-08-19 09:53:53miss-islingtonsetpull_requests: + pull_request15050
2019-08-19 09:53:40vstinnersetmessages: + msg349947
2019-08-15 04:48:01Greg Pricesetpull_requests: + pull_request15026
2019-08-15 03:17:09Greg Pricesetpull_requests: + pull_request15019
2019-08-14 11:05:23vstinnersetmessages: + msg349678
2019-08-03 08:18:56Greg Pricesetnosy: + Greg Price
messages: + msg348947
2019-08-03 08:10:30Greg Pricesetkeywords: + patch
stage: patch review
pull_requests: + pull_request14836
2019-04-05 18:38:44terry.reedysettitle: The behavior of str.isspace() for U+00A0 and U+202F is different from what is documented -> str.isspace() for U+00A0 and U+202F differs from document
versions: - Python 3.5, Python 3.6
2019-04-02 14:56:28SilentGhostsetmessages: + msg339339
versions: + Python 3.7, Python 3.8
2019-04-02 14:32:57Jun_setnosy: + Jun_
messages: + msg339336
2019-04-02 06:59:53SilentGhostsetnosy: + SilentGhost
messages: + msg339318
2019-04-02 06:45:21xtreaksetnosy: + lemburg
2019-04-02 06:36:07Juncreate