Index: Doc/library/unicodedata.rst =================================================================== --- Doc/library/unicodedata.rst (revision 87144) +++ Doc/library/unicodedata.rst (working copy) @@ -13,56 +13,164 @@ single: character pair: Unicode; database -This module provides access to the Unicode Character Database which defines -character properties for all Unicode characters. The data in this database is -based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly -available from ftp://ftp.unicode.org/. +This module provides access to the Unicode Character Database (UCD) which +defines character properties for all Unicode characters. The data contained in +this database is compiled from the `UCD version 6.0.0 +`_. -The module uses the same names and symbols as defined by the UnicodeData File -Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html). -It defines the following functions: +The module uses the same names and symbols as defined by Unicode Standard Annex +#44, `"Unicode Character Database (UCD)" +`_. It defines the following +functions: .. function:: lookup(name) - Look up character by name. If a character with the given name is found, return - the corresponding character. If not found, :exc:`KeyError` is raised. + Look up character by name. If a character with the given name is found, + return the corresponding character. If not found, :exc:`KeyError` is raised. + For example,:: + >>> unicodedata.lookup('PILCROW SIGN') + '¶' + The characters returned by this function are the same as those produced by + ``\N`` escape sequence in string literals:: + + >>> unicodedata.lookup('MIDDLE DOT') == '\N{MIDDLE DOT}' + True + .. function:: name(chr[, default]) Returns the name assigned to the character *chr* as a string. If no name is defined, *default* is returned, or, if not given, :exc:`ValueError` is - raised. + raised. For example,:: + >>> unicodedata.name('Ӝ') + 'CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS' + >>> unicodedata.name('\uFFFF', 'no name') + 'no name' + .. function:: decimal(chr[, default]) Returns the decimal value assigned to the character *chr* as integer. If no such value is defined, *default* is returned, or, if not given, - :exc:`ValueError` is raised. + :exc:`ValueError` is raised. For example,:: + >>> unicodedata.decimal('\N{ARABIC-INDIC DIGIT NINE}') + 9 + >>> unicodedata.decimal('\N{SUPERSCRIPT NINE}', -1) + -1 + + .. function:: digit(chr[, default]) Returns the digit value assigned to the character *chr* as integer. If no such value is defined, *default* is returned, or, if not given, - :exc:`ValueError` is raised. + :exc:`ValueError` is raised. For example,:: + >>> unicodedata.digit('\N{SUPERSCRIPT NINE}') + 9 + >>> unicodedata.decimal('\N{ROMAN NUMERAL NINE}', -1) + -1 + + .. function:: numeric(chr[, default]) Returns the numeric value assigned to the character *chr* as float. If no such value is defined, *default* is returned, or, if not given, :exc:`ValueError` is raised. + >>> unicodedata.numeric('½') + 0.5 + >>> unicodedata.numeric('\N{ROMAN NUMERAL TEN THOUSAND}') + 10000.0 + + .. function:: category(chr) - Returns the general category assigned to the character *chr* as - string. + Returns the general category assigned to the character *chr* as string. + General category names consist of two letters. The first letter is always + uppercase and denotes one of seven major categories: Letter (L), Mark (M), + Number (N), Punctuation (P), Symbol (S), Separator (Z), and Other (C). The + second letter is always lowercase and further subdivides major categories + into minor subcategories. + +--------------------------------------------------------------------------+ + | **General Categories** | + +----+-------------+------------------+------------------------------------+ + |Name|Major |Minor |Examples | + +====+=============+==================+====================================+ + |Lu | Letter | uppercase | | + +----+-------------+------------------+------------------------------------+ + |Ll | Letter | lowercase | | + +----+-------------+------------------+------------------------------------+ + |Lt | Letter | titlecase | | + +----+-------------+------------------+------------------------------------+ + |Lm | Letter | modifier | | + +----+-------------+------------------+------------------------------------+ + |Lo | Letter | other | | + +----+-------------+------------------+------------------------------------+ + |Mn | Mark | nonspacing | | + +----+-------------+------------------+------------------------------------+ + |Mc | Mark | spacing combining| | + +----+-------------+------------------+------------------------------------+ + |Me | Mark | enclosing | | + +----+-------------+------------------+------------------------------------+ + |Nd | Number | decimal digit | | + +----+-------------+------------------+------------------------------------+ + |Nl | Number | letter | | + +----+-------------+------------------+------------------------------------+ + |No | Number | other | | + +----+-------------+------------------+------------------------------------+ + |Pc | Punctuation | connector | | + +----+-------------+------------------+------------------------------------+ + |Pd | Punctuation | dash | | + +----+-------------+------------------+------------------------------------+ + |Ps | Punctuation | open | | + +----+-------------+------------------+------------------------------------+ + |Pe | Punctuation | close | | + +----+-------------+------------------+------------------------------------+ + |Pi | Punctuation | initial quote | | + +----+-------------+------------------+------------------------------------+ + |Pf | Punctuation | final quote | | + +----+-------------+------------------+------------------------------------+ + |Po | Punctuation | other | | + +----+-------------+------------------+------------------------------------+ + |Sm | Symbol | math | | + +----+-------------+------------------+------------------------------------+ + |Sc | Symbol | currency | | + +----+-------------+------------------+------------------------------------+ + |Sk | Symbol | modifier | | + +----+-------------+------------------+------------------------------------+ + |So | Symbol | other | | + +----+-------------+------------------+------------------------------------+ + |Zs | Separator | space | | + +----+-------------+------------------+------------------------------------+ + |Zl | Separator | line | | + +----+-------------+------------------+------------------------------------+ + |Zp | Separator | paragraph | | + +----+-------------+------------------+------------------------------------+ + |Cc | Other | control | | + +----+-------------+------------------+------------------------------------+ + |Cf | Other | format | | + +----+-------------+------------------+------------------------------------+ + |Cs | Other | surrogate | | + +----+-------------+------------------+------------------------------------+ + |Co | Other | private use | | + +----+-------------+------------------+------------------------------------+ + The following example program produces code point counts by major category: + + .. literalinclude:: ../includes/unistat.py + + :: + + Counter({'C': 1004868, 'L': 100520, 'S': 5508, 'M': 1498, 'N': 1100, 'P': 598, 'Z': 20}) + .. function:: bidirectional(chr) Returns the bidirectional category assigned to the character *chr* as @@ -158,4 +266,3 @@ 'Lu' >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber 'AN' - Index: Doc/includes/unistat.py =================================================================== --- Doc/includes/unistat.py (revision 0) +++ Doc/includes/unistat.py (revision 0) @@ -0,0 +1,9 @@ +import unicodedata +from collections import Counter + +catcount = Counter() +for i in range(0x110000): + cat = unicodedata.category(chr(i))[0] + catcount[cat] += 1 + +print(catcount) Property changes on: Doc/includes/unistat.py ___________________________________________________________________ Added: svn:keywords + Id Added: svn:eol-style + native