Index: Doc/library/unicodedata.rst =================================================================== --- Doc/library/unicodedata.rst (revision 87144) +++ Doc/library/unicodedata.rst (working copy) @@ -13,62 +13,201 @@ single: character pair: Unicode; database -This module provides access to the Unicode Character Database which defines -character properties for all Unicode characters. The data in this database is -based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly -available from ftp://ftp.unicode.org/. +This module provides access to the Unicode Character Database (UCD) which +defines character properties for all Unicode characters. The data contained in +this database is compiled from the `UCD version 6.0.0 +`_. -The module uses the same names and symbols as defined by the UnicodeData File -Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html). -It defines the following functions: +The module uses the same names and symbols as defined by Unicode Standard Annex +#44, `"Unicode Character Database (UCD)" +`_. It defines the following +functions: .. function:: lookup(name) - Look up character by name. If a character with the given name is found, return - the corresponding character. If not found, :exc:`KeyError` is raised. + Look up character by name. If a character with the given name is found, + return the corresponding character. If not found, :exc:`KeyError` is raised. + For example,:: + >>> unicodedata.lookup('PILCROW SIGN') + '¶' + The characters returned by this function are the same as those produced by + ``\N`` escape sequence in string literals:: + + >>> unicodedata.lookup('MIDDLE DOT') == '\N{MIDDLE DOT}' + True + .. function:: name(chr[, default]) Returns the name assigned to the character *chr* as a string. If no name is defined, *default* is returned, or, if not given, :exc:`ValueError` is - raised. + raised. For example,:: + >>> unicodedata.name('Ӝ') + 'CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS' + >>> unicodedata.name('\uFFFF', 'no name') + 'no name' + .. function:: decimal(chr[, default]) Returns the decimal value assigned to the character *chr* as integer. If no such value is defined, *default* is returned, or, if not given, - :exc:`ValueError` is raised. + :exc:`ValueError` is raised. For example,:: + >>> unicodedata.decimal('\N{ARABIC-INDIC DIGIT NINE}') + 9 + >>> unicodedata.decimal('\N{SUPERSCRIPT NINE}', -1) + -1 + + .. function:: digit(chr[, default]) Returns the digit value assigned to the character *chr* as integer. If no such value is defined, *default* is returned, or, if not given, - :exc:`ValueError` is raised. + :exc:`ValueError` is raised. For example,:: + >>> unicodedata.digit('\N{SUPERSCRIPT NINE}') + 9 + >>> unicodedata.digit('\N{ROMAN NUMERAL NINE}', -1) + -1 + + .. function:: numeric(chr[, default]) Returns the numeric value assigned to the character *chr* as float. If no such value is defined, *default* is returned, or, if not given, :exc:`ValueError` is raised. + >>> unicodedata.numeric('½') + 0.5 + >>> unicodedata.numeric('\N{ROMAN NUMERAL TEN THOUSAND}') + 10000.0 + + .. function:: category(chr) - Returns the general category assigned to the character *chr* as - string. + Returns the general category assigned to the character *chr* as string. + General category names consist of two letters. The first letter is always + uppercase and denotes one of seven major categories: Letter (L), Mark (M), + Number (N), Punctuation (P), Symbol (S), Separator (Z), and Other (C). The + second letter is always lowercase and further subdivides major categories + into minor subcategories. + +--------------------------------------------------------------------------+ + | **General Categories** | + +----+-------------+------------------+------------------------------------+ + |Name|Major |Minor |Examples | + +====+=============+==================+====================================+ + |Lu | Letter | uppercase | | + +----+-------------+------------------+------------------------------------+ + |Ll | Letter | lowercase | | + +----+-------------+------------------+------------------------------------+ + |Lt | Letter | titlecase | | + +----+-------------+------------------+------------------------------------+ + |Lm | Letter | modifier | | + +----+-------------+------------------+------------------------------------+ + |Lo | Letter | other | | + +----+-------------+------------------+------------------------------------+ + |Mn | Mark | nonspacing | | + +----+-------------+------------------+------------------------------------+ + |Mc | Mark | spacing combining| | + +----+-------------+------------------+------------------------------------+ + |Me | Mark | enclosing | | + +----+-------------+------------------+------------------------------------+ + |Nd | Number | decimal digit | | + +----+-------------+------------------+------------------------------------+ + |Nl | Number | letter | | + +----+-------------+------------------+------------------------------------+ + |No | Number | other | | + +----+-------------+------------------+------------------------------------+ + |Pc | Punctuation | connector | | + +----+-------------+------------------+------------------------------------+ + |Pd | Punctuation | dash | | + +----+-------------+------------------+------------------------------------+ + |Ps | Punctuation | open | | + +----+-------------+------------------+------------------------------------+ + |Pe | Punctuation | close | | + +----+-------------+------------------+------------------------------------+ + |Pi | Punctuation | initial quote | | + +----+-------------+------------------+------------------------------------+ + |Pf | Punctuation | final quote | | + +----+-------------+------------------+------------------------------------+ + |Po | Punctuation | other | | + +----+-------------+------------------+------------------------------------+ + |Sm | Symbol | math | | + +----+-------------+------------------+------------------------------------+ + |Sc | Symbol | currency | | + +----+-------------+------------------+------------------------------------+ + |Sk | Symbol | modifier | | + +----+-------------+------------------+------------------------------------+ + |So | Symbol | other | | + +----+-------------+------------------+------------------------------------+ + |Zs | Separator | space | | + +----+-------------+------------------+------------------------------------+ + |Zl | Separator | line | | + +----+-------------+------------------+------------------------------------+ + |Zp | Separator | paragraph | | + +----+-------------+------------------+------------------------------------+ + |Cc | Other | control | | + +----+-------------+------------------+------------------------------------+ + |Cf | Other | format | | + +----+-------------+------------------+------------------------------------+ + |Cs | Other | surrogate | | + +----+-------------+------------------+------------------------------------+ + |Co | Other | private use | | + +----+-------------+------------------+------------------------------------+ + |Cn | Other | not assigned | | + +----+-------------+------------------+------------------------------------+ + The following example program produces code point counts by major category: + + .. literalinclude:: ../includes/unistat.py + + :: + + Counter({'C': 1004868, 'L': 100520, 'S': 5508, 'M': 1498, 'N': 1100, 'P': 598, 'Z': 20}) + .. function:: bidirectional(chr) - Returns the bidirectional category assigned to the character *chr* as - string. If no such value is defined, an empty string is returned. + Returns the bidirectional class assigned to the character *chr* as + string. If no such value is defined, an empty string is returned. For example,:: + >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber + 'AN' + Bidirectional class names returned by this function have the following meaning: + + ===== ========================= + Class Description + ===== ========================= + AL Arabic Letter + AN Arabic Number + B Paragraph Separator + BN Boundary Neutral + CS Common Separator + EN European Number + ES European Separator + ET European Terminator + L Left To Right + LRE Left To Right Embedding + LRO Left To Right Override + NSM Nonspacing Mark + ON Other Neutral + PDF Pop Directional Format + R Right To Left + RLE Right To Left Embedding + RLO Right To Left Override + S Segment Separator + WS White Space + ===== ========================= + + .. function:: combining(chr) Returns the canonical combining class assigned to the character *chr* @@ -80,21 +219,37 @@ Returns the east asian width assigned to the character *chr* as string. + ==== ============ + Code Description + ==== ============ + A Ambiguous + F Fullwidth + H Halfwidth + N Neutral + Na Narrow + W Wide + ==== ============ .. function:: mirrored(chr) Returns the mirrored property assigned to the character *chr* as integer. Returns ``1`` if the character has been identified as a "mirrored" - character in bidirectional text, ``0`` otherwise. + character in bidirectional text, ``0`` otherwise. For example,:: + >>> unicodedata.mirrored('>') + 1 + .. function:: decomposition(chr) Returns the character decomposition mapping assigned to the character *chr* as string. An empty string is returned in case no such mapping is - defined. + defined. For example,:: + >>> unicodedata.decomposition('è') + '0065 0300' + .. function:: normalize(form, unistr) Return the normal form *form* for the Unicode string *unistr*. Valid values for @@ -156,6 +311,3 @@ ValueError: not a decimal >>> unicodedata.category('A') # 'L'etter, 'u'ppercase 'Lu' - >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber - 'AN' - Index: Doc/includes/unistat.py =================================================================== --- Doc/includes/unistat.py (revision 0) +++ Doc/includes/unistat.py (revision 0) @@ -0,0 +1,9 @@ +import unicodedata +from collections import Counter + +catcount = Counter() +for i in range(0x110000): + cat = unicodedata.category(chr(i))[0] + catcount[cat] += 1 + +print(catcount) Property changes on: Doc/includes/unistat.py ___________________________________________________________________ Added: svn:keywords + Id Added: svn:eol-style + native