This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Unicode control characters are not allowed as identifiers
Type: behavior Stage:
Components: Unicode Versions: Python 3.0, Python 3.1
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: baijum, ezio.melotti, loewis, mrabarnett
Priority: normal Keywords:

Created on 2009-02-24 11:53 by baijum, last changed 2022-04-11 14:56 by admin. This issue is now closed.

File name Uploaded Description Edit baijum, 2009-02-24 11:53 File with Unicode control character in identifier
Messages (7)
msg82664 - (view) Author: Baiju M (baijum) Date: 2009-02-24 11:53
I tried to use Zero-width joiner (U+200D) as part of an identifier.
It produce an exception like this:

SyntaxError: invalid character in identifier

I have attached the Python file which produce this error.

Zero-width joiner (U+200D) is a Unicode control character:
msg82666 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-24 16:21
Why do you think this is a bug?
msg82820 - (view) Author: Baiju M (baijum) Date: 2009-02-27 06:47
On a further look at this issue, I understood Python cannot use all
Unicode control characters as identifiers.  But for many international
languages, without some control characters like ZWJ & ZWNJ [1], it won't
be possible to construct all characters with proper visual
representation.  So, if Python really want to support international
characters as identifiers (for some reason), ZWJ & ZWNJ are unavoidable,
may be some other characters also.

msg82821 - (view) Author: Baiju M (baijum) Date: 2009-02-27 07:24
I think RFC-3454 [1] can be used as a base for selecting the control
characters which can be used as a valid identifier character.

msg82822 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-27 07:48
Valid identifiers should begin with a letter or '_' and contain only
letters, numbers and '_'. This probably means that only the Unicode
characters that belong to the categories Ll, Lu (Letter Lower/Upper
case), Nd (Number, Decimal Digit) and Pc (Punctuation, Connector) - and
possibly other categories like Lm, Lt, No and Nl - are valid.

Some examples:
>>> a-b = 5 # U+FF0D, Cat: Pd, FULLWIDTH HYPHEN-MINUS
SyntaxError: invalid character in identifier
>>> a# = 5 # U+FF03, Cat: Po, FULLWIDTH NUMBER SIGN
SyntaxError: invalid character in identifier
SyntaxError: invalid character in identifier
>>> a_b = 5 # U+FF3F, Cat: Pc, FULLWIDTH LOW LINE
>>> a_b
>>> a﹍b﹎c﹏d = 5 # U+FE4D, U+FE4E, U+FE4F, Cat: Pc
>>> a﹍b﹎c﹏d
msg82842 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2009-02-27 16:54
The definition of a word in the new re module (actually targetted at
Python 2.7) is currently a sequence of L&, N&, M& and Pc.

I suppose ideally we want the definitions of a word and an identifier to
be basically the same, except that an identifier can't start with N&.
msg82858 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-27 18:32
See PEP 3131 for a specification what is an identifier in Python.

Closing this as "won't fix".
Date User Action Args
2022-04-11 14:56:46adminsetgithub: 49608
2009-02-27 18:32:17loewissetstatus: open -> closed
resolution: wont fix
messages: + msg82858
2009-02-27 16:54:59mrabarnettsetnosy: + mrabarnett
messages: + msg82842
2009-02-27 07:48:19ezio.melottisetmessages: + msg82822
2009-02-27 07:24:12baijumsetmessages: + msg82821
2009-02-27 06:47:51baijumsetmessages: + msg82820
2009-02-24 17:56:16ezio.melottisetnosy: + ezio.melotti
2009-02-24 16:21:44loewissetnosy: + loewis
messages: + msg82666
2009-02-24 11:53:50baijumcreate