Title: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
Status: open Resolution:
Dependencies: 12731 12734 Superseder:
Assigned To: terry.reedy Nosy List: Joshua.Landau, iritkatriel, meador.inge, terry.reedy
Priority: normal Keywords: patch

Created on 2015-05-14 13:00 by Joshua.Landau, last changed 2022-04-11 14:58 by admin.

File name Uploaded Description Edit
issue24194-v0.patch meador.inge, 2016-05-11 01:45 review
Messages (5)
msg243188 - (view) Author: Joshua Landau (Joshua.Landau) * Date: 2015-05-14 13:00
This is valid:

    ℘· = 1
    #>>> 1

But this gives an error token:

    from io import BytesIO
    from tokenize import tokenize

    stream = BytesIO("℘·".encode("utf-8"))
    print(*tokenize(, sep="\n")
    #>>> TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
    #>>> TokenInfo(type=53 (ERRORTOKEN), string='℘', start=(1, 0), end=(1, 1), line='℘·')
    #>>> TokenInfo(type=53 (ERRORTOKEN), string='·', start=(1, 1), end=(1, 2), line='℘·')
    #>>> TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

This is a continuation of I'm not able to reopen the issue, so I thought I should report it anew.

It is tokenize that is wrong - Other_ID_Start and Other_ID_Continue are documented to be valid:
msg265286 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2016-05-11 01:45
Attached is a first cut patch for this.  (CC'd haypo as a unicode expert).
msg313851 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-03-15 00:55
I closed #1693050 as a duplicate of #12731 (the /w issue).  I left #9712 closed and closed #32987 and marked both as duplicates of this.

In msg313814 of the latter, Serhiy indicates which start and continue identifier characters are currently matched by \W for re and regex.  He gives there a fix for this that he says requires the /w issue to be fixed. It is similar to the posted patch.  He says that without \w fixed, another 2000+ chars need to be added.  Perhaps the v0 patch needs more tests (I don't know.)

He also says that re support for properties, #12734,  would make things even better.

Three of the characters in the patch are too obscure for Firefox on Window2 and print as boxes.  Some others I do not recognize.  And I could not type any of them.  I thought we had a policy of using \u or \U escapes even in tests to avoid such problems.  (I notice that there are already non-ascii chars in the context.)
msg410718 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2022-01-16 19:57
Reproduced on 3.11.
msg410731 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2022-01-16 23:10
Udated doc link, which appears to be same:

Updated property list linked in above:

Relevant content for this issue:


2118          ; Other_ID_Start # Sm       SCRIPT CAPITAL P
212E          ; Other_ID_Start # So       ESTIMATED SYMBOL
# Total code points: 6

00B7          ; Other_ID_Continue # Po       MIDDLE DOT
0387          ; Other_ID_Continue # Po       GREEK ANO TELEIA
1369..1371    ; Other_ID_Continue # No   [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
19DA          ; Other_ID_Continue # No       NEW TAI LUE THAM DIGIT ONE
# Total code points: 12

Codepoints of '℘·' opening example: 
'0x2118' Other_Id_start  Sm Script Capital P
'0xb7'   Other_Id_continue  P0 Middle dot

Except for the two Mongolian start characters, Meador's patch hardcodes the 'Other' characters, thereby adding them without waiting for re to be fixed.  While this will miss new additions without manual updates, it is better than missing everything for however many years.  I will make a PR with the additions and looks at the new tests.
