classification
Title: Unicode control characters are not allowed as identifiers
Type: behavior Stage:
Components: Unicode Versions: Python 3.0, Python 3.1
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: baijum, ezio.melotti, loewis, mrabarnett
Priority: normal Keywords:

Created on 2009-02-24 11:53 by baijum, last changed 2009-02-27 18:32 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
identifier.py baijum, 2009-02-24 11:53 File with Unicode control character in identifier
Messages (7)
msg82664 - (view) Author: Baiju M (baijum) Date: 2009-02-24 11:53
I tried to use Zero-width joiner (U+200D) as part of an identifier.
It produce an exception like this:

SyntaxError: invalid character in identifier

I have attached the Python file which produce this error.

Zero-width joiner (U+200D) is a Unicode control character:
http://en.wikipedia.org/wiki/Unicode_control_characters
msg82666 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-24 16:21
Why do you think this is a bug?
msg82820 - (view) Author: Baiju M (baijum) Date: 2009-02-27 06:47
On a further look at this issue, I understood Python cannot use all
Unicode control characters as identifiers.  But for many international
languages, without some control characters like ZWJ & ZWNJ [1], it won't
be possible to construct all characters with proper visual
representation.  So, if Python really want to support international
characters as identifiers (for some reason), ZWJ & ZWNJ are unavoidable,
may be some other characters also.

 [1] http://en.wikipedia.org/wiki/Zero-width_joiner
     http://en.wikipedia.org/wiki/Zero-width_non-joiner
msg82821 - (view) Author: Baiju M (baijum) Date: 2009-02-27 07:24
I think RFC-3454 [1] can be used as a base for selecting the control
characters which can be used as a valid identifier character.

 [1] http://www.rfc-editor.org/rfc/rfc3454.txt
msg82822 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-27 07:48
Valid identifiers should begin with a letter or '_' and contain only
letters, numbers and '_'. This probably means that only the Unicode
characters that belong to the categories Ll, Lu (Letter Lower/Upper
case), Nd (Number, Decimal Digit) and Pc (Punctuation, Connector) - and
possibly other categories like Lm, Lt, No and Nl - are valid.

Some examples:
>>> a-b = 5 # U+FF0D, Cat: Pd, FULLWIDTH HYPHEN-MINUS
SyntaxError: invalid character in identifier
>>> a# = 5 # U+FF03, Cat: Po, FULLWIDTH NUMBER SIGN
SyntaxError: invalid character in identifier
>>> a)b = 5 # U+FF09, Cat: Pe, FULLWIDTH RIGHT PARENTHESIS
SyntaxError: invalid character in identifier
>>> a_b = 5 # U+FF3F, Cat: Pc, FULLWIDTH LOW LINE
>>> a_b
5
>>> a﹍b﹎c﹏d = 5 # U+FE4D, U+FE4E, U+FE4F, Cat: Pc
>>> a﹍b﹎c﹏d
5
msg82842 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-27 16:54
The definition of a word in the new re module (actually targetted at
Python 2.7) is currently a sequence of L&, N&, M& and Pc.

I suppose ideally we want the definitions of a word and an identifier to
be basically the same, except that an identifier can't start with N&.
msg82858 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-27 18:32
See PEP 3131 for a specification what is an identifier in Python.

Closing this as "won't fix".
History
Date User Action Args
2009-02-27 18:32:17loewissetstatus: open -> closed
resolution: wont fix
messages: + msg82858
2009-02-27 16:54:59mrabarnettsetnosy: + mrabarnett
messages: + msg82842
2009-02-27 07:48:19ezio.melottisetmessages: + msg82822
2009-02-27 07:24:12baijumsetmessages: + msg82821
2009-02-27 06:47:51baijumsetmessages: + msg82820
2009-02-24 17:56:16ezio.melottisetnosy: + ezio.melotti
2009-02-24 16:21:44loewissetnosy: + loewis
messages: + msg82666
2009-02-24 11:53:50baijumcreate