classification
Title: str.isidentifier() does not work with non-BMP non-canonicalized strings on Windows
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2020-05-11 17:50 by serhiy.storchaka, last changed 2020-05-12 17:27 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 20035 closed serhiy.storchaka, 2020-05-11 17:55
PR 20053 merged serhiy.storchaka, 2020-05-12 10:39
Messages (6)
msg368637 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-05-11 17:50
>>> import _testcapi
>>> u = '\U0001d580\U0001d593\U0001d58e\U0001d588\U0001d594\U0001d589\U0001d58a'
>>> u.isidentifier()
True
>>> _testcapi.unicode_legacy_string(u).isidentifier()
False
msg368651 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-11 21:05
It's maybe time to speed up the deprecation of the legacy C API using Py_UNICODE...
msg368652 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-11 21:06
My previous change on this function:

commit f3e7ea5b8c220cd63101e419d529c8563f9c6115
Author: Victor Stinner <vstinner@python.org>
Date:   Tue Feb 11 14:29:33 2020 +0100

    bpo-39500: Document PyUnicode_IsIdentifier() function (GH-18397)
    
    PyUnicode_IsIdentifier() does not call Py_FatalError() anymore if the
    string is not ready.
msg368705 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-05-12 07:40
I am not sure that changes in issue39500 was correct. It is easier to catch a bug if crash consistently when you pass a non-canonicalized strings then if silently return a wrong result for specific input on particular platform.

Alternatively, you could reimplement correct handling of surrogate pairs in  PyUnicode_IsIdentifier().
msg368729 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-05-12 13:18
New changeset 5650e76f63a6f4ec55d00ec13f143d84a2efee39 by Serhiy Storchaka in branch 'master':
bpo-40596: Fix str.isidentifier() for non-canonicalized strings containing non-BMP characters on Windows. (GH-20053)
https://github.com/python/cpython/commit/5650e76f63a6f4ec55d00ec13f143d84a2efee39
msg368739 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-05-12 17:27
Thanks for the fix Serhiy!
History
Date User Action Args
2020-05-12 17:27:57vstinnersetmessages: + msg368739
2020-05-12 13:19:12serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-05-12 13:18:10serhiy.storchakasetmessages: + msg368729
2020-05-12 10:39:32serhiy.storchakasetpull_requests: + pull_request19362
2020-05-12 07:40:30serhiy.storchakasetmessages: + msg368705
2020-05-11 21:06:17vstinnersetmessages: + msg368652
2020-05-11 21:05:11vstinnersetmessages: + msg368651
2020-05-11 17:55:50serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request19345
2020-05-11 17:50:16serhiy.storchakacreate