Issue 9712: tokenize yield an ERRORTOKEN if the identifier starts with a non-ascii char

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53921

classification

Title:	tokenize yield an ERRORTOKEN if the identifier starts with a non-ascii char
Type:	behavior	Stage:	test needed
Components:		Versions:	Python 3.1, Python 3.2, Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:	Make tokenize recognize Other_ID_Start and Other_ID_Continue chars View: 24194
Assigned To:		Nosy List:	Joshua.Landau, benjamin.peterson, flox, terry.reedy
Priority:	normal	Keywords:

Created on 2010-08-30 07:42 by flox, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (5)
msg115201 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-08-30 07:42
from io import BytesIO from tokenize import tokenize, tok_name sample = 'éléphants = "un éléphant, deux éléphants, ..."\nprint(éléphants)\n' sampleb = sample.encode('utf-8') exec(sample) # output: un éléphant, deux éléphants, ... exec(sampleb) # output: un éléphant, deux éléphants, ... module = BytesIO() module.write(sampleb) module.seek(0) for line in tokenize(module.readline): print(tok_name[line.type], line) # output: ENCODING TokenInfo(type=57, string='utf-8', start=(0, 0), end=(0, 0), line='') ERRORTOKEN TokenInfo(type=54, string='é', start=(1, 0), end=(1, 1), line='éléphants = "un éléphant, deux éléphants, ..."\n') NAME TokenInfo(type=1, string='léphants', start=(1, 1), end=(1, 9), line='éléphants = "un éléphant, deux éléphants, ..."\n') OP TokenInfo(type=53, string='=', start=(1, 10), end=(1, 11), line='éléphants = "un éléphant, deux éléphants, ..."\n') STRING TokenInfo(type=3, string='"un éléphant, deux éléphants, ..."', start=(1, 12), end=(1, 46), line='éléphants = "un éléphant, deux éléphants, ..."\n') NEWLINE TokenInfo(type=4, string='\n', start=(1, 46), end=(1, 47), line='éléphants = "un éléphant, deux éléphants, ..."\n') NAME TokenInfo(type=1, string='print', start=(2, 0), end=(2, 5), line='print(éléphants)\n') OP TokenInfo(type=53, string='(', start=(2, 5), end=(2, 6), line='print(éléphants)\n') ERRORTOKEN TokenInfo(type=54, string='é', start=(2, 6), end=(2, 7), line='print(éléphants)\n') NAME TokenInfo(type=1, string='léphants', start=(2, 7), end=(2, 15), line='print(éléphants)\n') OP TokenInfo(type=53, string=')', start=(2, 15), end=(2, 16), line='print(éléphants)\n') NEWLINE TokenInfo(type=4, string='\n', start=(2, 16), end=(2, 17), line='print(éléphants)\n') ENDMARKER TokenInfo(type=0, string='', start=(3, 0), end=(3, 0), line='')
msg115218 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2010-08-30 14:41
r84364
msg240544 - (view)	Author: Joshua Landau (Joshua.Landau) *	Date: 2015-04-12 06:08
This doesn't seem to be a complete fix; the regex used does not include Other_ID_Start or Other_ID_Continue from https://docs.python.org/3.5/reference/lexical_analysis.html#identifiers Hence tokenize does not accept '℘·'. Credit to modchan from http://stackoverflow.com/a/29586366/1763356.
msg313846 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-03-15 00:12
Joshua opened #24194 as a duplicate of this because he could not reopen this. I am leaving it open as the superseder for this as Serhiy has already added two dependencies there, and because this seems to be a duplicate in turn of #1693050 (which I will close along with #32987).
msg313847 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-03-15 00:18
Actually, #1693050 and #12731, about \w, are duplicates.

History
Date	User	Action	Args
2022-04-11 14:57:05	admin	set	github: 53921
2018-03-15 00:18:48	terry.reedy	set	messages: + msg313847
2018-03-15 00:12:21	terry.reedy	set	superseder: Make tokenize recognize Other_ID_Start and Other_ID_Continue chars messages: + msg313846 nosy: + terry.reedy
2015-04-12 06:08:01	Joshua.Landau	set	nosy: + Joshua.Landau messages: + msg240544 versions: + Python 3.4
2010-08-30 14:41:43	benjamin.peterson	set	status: open -> closed nosy: + benjamin.peterson messages: + msg115218 resolution: fixed
2010-08-30 07:42:55	flox	create