Message 142502 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	ezio.melotti, mrabarnett, tchrist, terry.reedy
Date	2011-08-19.22:50:56
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1313794258.42.0.875776753601.issue12753@psf.upfronthosting.co.za>
In-reply-to

Content
I verified that the test file raises the quoted SyntaxError on 3.2 on Win7. This: >>> "\N{LATIN CAPITAL LETTER GHA}" SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name is most likely a result of this: >>> unicodedata.lookup("LATIN CAPITAL LETTER GHA") Traceback (most recent call last): File "<pyshell#1>", line 1, in <module> unicodedata.lookup("LATIN CAPITAL LETTER GHA") KeyError: "undefined character name 'LATIN CAPITAL LETTER GHA'" Although the lookup comes first in nametests.py, it is never executed because of the later SyntaxError. The Reference for string literals says" "\N{name} Character named name in the Unicode database" The doc for unicodedata says "This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 6.0.0. The module uses the same names and symbols as defined by Unicode Standard Annex #44, “Unicode Character Database”." http://www.unicode.org/reports/tr44/tr44-6.html So the question is, what are the 'names' therein defined? All such should be valid inputs to "unicodedata.lookup(name) Look up character by name." The annex refers to http://www.unicode.org/Public/6.0.0/ucd/ This contains NamesList.txt, derived from UnicodeData.txt. Unicodedata must be using just the latter. The ucd directory also contains NameAliases.txt, NamedSequences.txt, and the file of provisional named sequences. As best I can tell, the annex plus files are a bit ambiguous as to 'Unicode character name'. The following quote seems neutral: "the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names." The following: "Unicode character names constitute a special case. Formally, they are values of the Name property." points toward UnicodeData.txt, which lists the Name property along with others. However, "Unicode character name, as published in the Unicode names list," indirectly points toward including aliases. NamesList.txt says it contains the "Final Unicode 6.0 names list." (but one which "should not be parsed for machine-readable information". It includes all 11 aliases in NameAliases.txt. My current opinion is that adding the aliases might be done in current releases. It certainly would serve the any user who does not know to misspell 'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars. Adding named sequences is definitely a feature request. The definition of .lookup(name) would be enlarged to "Look up character by name, alias, or named sequence" with reference to the specific files. The meaning of \N{} would also have to be enlarged. Minimal test code might be: from unicodedata import lookup AssertEqual(lookup("LATIN CAPITAL LETTER GHA")), "\u01a2") AssertEqual(lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE"), "\u0100\u0300") plus a test that "\N{LATIN CAPITAL LETTER GHA}" and "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" compile without error (I have no idea how to write that). --- > "If you look at the ICU UCharacter class, you can see that they provide a more" More what ;-) I presume ICU =International Components for Unicode, icu-project.org/ "Offers a portable set of C/C++ and Java libraries for Unicode support, software internationalization (I18N) and globalization (G11N)." [appears to be free, open source, and possibly usable within Python]

I verified that the test file raises the quoted SyntaxError on 3.2 on Win7. This:

>>> "\N{LATIN CAPITAL LETTER GHA}"
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name

is most likely a result of this:

>>> unicodedata.lookup("LATIN CAPITAL LETTER GHA")
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    unicodedata.lookup("LATIN CAPITAL LETTER GHA")
KeyError: "undefined character name 'LATIN CAPITAL LETTER GHA'"

Although the lookup comes first in nametests.py, it is never executed because of the later SyntaxError.

The Reference for string literals says" 
"\N{name} Character named name in the Unicode database"

The doc for unicodedata says
"This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 6.0.0.

The module uses the same names and symbols as defined by Unicode Standard Annex #44, “Unicode Character Database”." 
http://www.unicode.org/reports/tr44/tr44-6.html

So the question is, what are the 'names' therein defined?
All such should be valid inputs to 
"unicodedata.lookup(name) Look up character by name."

The annex refers to http://www.unicode.org/Public/6.0.0/ucd/
This contains NamesList.txt, derived from UnicodeData.txt. Unicodedata must be using just the latter. The ucd directory also contains NameAliases.txt, NamedSequences.txt, and the file of provisional named sequences.

As best I can tell, the annex plus files are a bit ambiguous as to  'Unicode character name'. The following quote seems neutral: "the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names." The following: "Unicode character names constitute a special case. Formally, they are values of the Name property." points toward UnicodeData.txt, which lists the Name property along with others. However, "Unicode character name, as published in the Unicode names list," indirectly points toward including aliases. NamesList.txt says it contains the "Final Unicode 6.0 names list." (but one which "should not be parsed for machine-readable information". It includes all 11 aliases in NameAliases.txt. 

My current opinion is that adding the aliases might be done in current releases. It certainly would serve the any user who does not know to misspell 'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars.

Adding named sequences is definitely a feature request. The definition of .lookup(name) would be enlarged to "Look up character by name, alias, or named sequence" with reference to the specific files. The meaning of \N{} would also have to be enlarged.

Minimal test code might be:

from unicodedata import lookup
AssertEqual(lookup("LATIN CAPITAL LETTER GHA")), "\u01a2")
AssertEqual(lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE"),
   "\u0100\u0300")
plus a test that "\N{LATIN CAPITAL LETTER GHA}" and
"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" compile without error (I have no idea how to write that).

---
> "If you look at the ICU UCharacter class, you can see that they provide a more"

More what ;-)
I presume ICU =International Components for Unicode, icu-project.org/
"Offers a portable set of C/C++ and Java libraries for Unicode support, software internationalization (I18N) and globalization (G11N)."
[appears to be free, open source, and possibly usable within Python]

History
Date	User	Action	Args
2011-08-19 22:50:58	terry.reedy	set	recipients: + terry.reedy, ezio.melotti, mrabarnett, tchrist
2011-08-19 22:50:58	terry.reedy	set	messageid: <1313794258.42.0.875776753601.issue12753@psf.upfronthosting.co.za>
2011-08-19 22:50:57	terry.reedy	link	issue12753 messages
2011-08-19 22:50:56	terry.reedy	create