Index: Doc/library/unicodedata.rst =================================================================== --- Doc/library/unicodedata.rst (revision 87160) +++ Doc/library/unicodedata.rst (working copy) @@ -18,58 +18,196 @@ this database is compiled from the `UCD version 6.0.0 `_. -The module uses the same names and symbols as defined by Unicode -Standard Annex #44, `"Unicode Character Database" -`_. It defines the -following functions: +The module uses the same names and symbols as defined by Unicode Standard Annex +#44, `"Unicode Character Database (UCD)" +`_. It defines the following +functions: .. function:: lookup(name) - Look up character by name. If a character with the given name is found, return - the corresponding character. If not found, :exc:`KeyError` is raised. + Look up character by name. If a character with the given name is found, + return the corresponding character. If not found, :exc:`KeyError` is raised. + For example,:: + >>> unicodedata.lookup('PILCROW SIGN') + '¶' + The characters returned by this function are the same as those produced by + ``\N`` escape sequence in string literals:: + + >>> unicodedata.lookup('MIDDLE DOT') == '\N{MIDDLE DOT}' + True + .. function:: name(chr[, default]) Returns the name assigned to the character *chr* as a string. If no name is defined, *default* is returned, or, if not given, :exc:`ValueError` is - raised. + raised. For example,:: + >>> unicodedata.name('Ӝ') + 'CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS' + >>> unicodedata.name('\uFFFF', 'no name') + 'no name' + .. function:: decimal(chr[, default]) Returns the decimal value assigned to the character *chr* as integer. If no such value is defined, *default* is returned, or, if not given, - :exc:`ValueError` is raised. + :exc:`ValueError` is raised. For example,:: + >>> unicodedata.decimal('\N{ARABIC-INDIC DIGIT NINE}') + 9 + >>> unicodedata.decimal('\N{SUPERSCRIPT NINE}', -1) + -1 + + .. function:: digit(chr[, default]) Returns the digit value assigned to the character *chr* as integer. If no such value is defined, *default* is returned, or, if not given, - :exc:`ValueError` is raised. + :exc:`ValueError` is raised. For example,:: + >>> unicodedata.digit('\N{SUPERSCRIPT NINE}') + 9 + >>> unicodedata.digit('\N{ROMAN NUMERAL NINE}', -1) + -1 + + .. function:: numeric(chr[, default]) Returns the numeric value assigned to the character *chr* as float. If no such value is defined, *default* is returned, or, if not given, :exc:`ValueError` is raised. + >>> unicodedata.numeric('½') + 0.5 + >>> unicodedata.numeric('\N{ROMAN NUMERAL TEN THOUSAND}') + 10000.0 + + .. function:: category(chr) - Returns the general category assigned to the character *chr* as - string. + Returns the general category assigned to the character *chr* as string. + General category names consist of two letters. The first letter is always + uppercase and denotes one of seven major categories: Letter (L), Mark (M), + Number (N), Punctuation (P), Symbol (S), Separator (Z), and Other (C). The + second letter is always lowercase and further subdivides major categories + into minor subcategories. + +--------------------------------------------------------------------------+ + | **General Categories** | + +----+-------------+------------------+------------------------------------+ + |Name|Major |Minor |Examples | + +====+=============+==================+====================================+ + |Lu | Letter | uppercase | 'A', 'Z', 'Ω' | + +----+-------------+------------------+------------------------------------+ + |Ll | Letter | lowercase | 'a', 'z', 'ω' | + +----+-------------+------------------+------------------------------------+ + |Lt | Letter | titlecase | 'Dž', 'Lj', 'ῼ'' | + +----+-------------+------------------+------------------------------------+ + |Lm | Letter | modifier | 'ʰ', 'ʲ', 'ʶ' | + +----+-------------+------------------+------------------------------------+ + |Lo | Letter | other | 'ƻ', 'א' ,'ث' | + +----+-------------+------------------+------------------------------------+ + |Mn | Mark | nonspacing | '\\u0300' (GRAVE ACCENT) | + +----+-------------+------------------+------------------------------------+ + |Mc | Mark | spacing combining| 'ः' (DEVANAGARI SIGN VISARGA) | + +----+-------------+------------------+------------------------------------+ + |Me | Mark | enclosing | '\\u20DD' (ENCLOSING CIRCLE) | + +----+-------------+------------------+------------------------------------+ + |Nd | Number | decimal digit | '1', '١', '१' | + +----+-------------+------------------+------------------------------------+ + |Nl | Number | letter | 'Ⅸ' (ROMAN NUMERAL NINE) | + +----+-------------+------------------+------------------------------------+ + |No | Number | other | '²' (SUPERSCRIPT TWO) | + +----+-------------+------------------+------------------------------------+ + |Pc | Punctuation | connector | '_' (ASCII UNDERSCORE) | + +----+-------------+------------------+------------------------------------+ + |Pd | Punctuation | dash | '-' (ASCII HYPHEN-MINUS) | + +----+-------------+------------------+------------------------------------+ + |Ps | Punctuation | open | '(', '[', '{' | + +----+-------------+------------------+------------------------------------+ + |Pe | Punctuation | close | ')', ']', '}' | + +----+-------------+------------------+------------------------------------+ + |Pi | Punctuation | initial quote | '«', '‘', '⸠' | + +----+-------------+------------------+------------------------------------+ + |Pf | Punctuation | final quote | '»', '’', '⸡' | + +----+-------------+------------------+------------------------------------+ + |Po | Punctuation | other | '!', '"', '¿' | + +----+-------------+------------------+------------------------------------+ + |Sm | Symbol | math | '+', '=', '±' | + +----+-------------+------------------+------------------------------------+ + |Sc | Symbol | currency | '$', '£', '¥' | + +----+-------------+------------------+------------------------------------+ + |Sk | Symbol | modifier | '\\u00B8' (CEDILLA) | + +----+-------------+------------------+------------------------------------+ + |So | Symbol | other | '☹' (FACE), '�' (REPLACEMENT CHAR) | + +----+-------------+------------------+------------------------------------+ + |Zs | Separator | space | ' ' (ASCII SPACE) | + +----+-------------+------------------+------------------------------------+ + |Zl | Separator | line | '\\u2028' (LINE SEPARATOR) | + +----+-------------+------------------+------------------------------------+ + |Zp | Separator | paragraph | '\\u2029' (PARAGRAPH SEPARATOR) | + +----+-------------+------------------+------------------------------------+ + |Cc | Other | control | '\\0' (NULL), '\\t' (TAB) | + +----+-------------+------------------+------------------------------------+ + |Cf | Other | format | '\\u00AD' (SOFT HYPHEN) | + +----+-------------+------------------+------------------------------------+ + |Cs | Other | surrogate | '\\uD800' - '\\uDFFF' | + +----+-------------+------------------+------------------------------------+ + |Co | Other | private use | '\\uE000' - '\\uF8FF' | + +----+-------------+------------------+------------------------------------+ + |Cn | Other | not assigned | '\\uFFFF' | + +----+-------------+------------------+------------------------------------+ + The following example program produces code point counts by major category: + + .. literalinclude:: ../includes/unistat.py + + :: + + Counter({'C': 1004868, 'L': 100520, 'S': 5508, 'M': 1498, 'N': 1100, 'P': 598, 'Z': 20}) + .. function:: bidirectional(chr) - Returns the bidirectional category assigned to the character *chr* as - string. If no such value is defined, an empty string is returned. + Returns the bidirectional class assigned to the character *chr* as + string. If no such value is defined, an empty string is returned. For example,:: + >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber + 'AN' + Bidirectional class names returned by this function have the following meaning: + + ===== ========================= + Class Description + ===== ========================= + AL Arabic Letter + AN Arabic Number + B Paragraph Separator + BN Boundary Neutral + CS Common Separator + EN European Number + ES European Separator + ET European Terminator + L Left To Right + LRE Left To Right Embedding + LRO Left To Right Override + NSM Nonspacing Mark + ON Other Neutral + PDF Pop Directional Format + R Right To Left + RLE Right To Left Embedding + RLO Right To Left Override + S Segment Separator + WS White Space + ===== ========================= + + .. function:: combining(chr) Returns the canonical combining class assigned to the character *chr* @@ -81,21 +219,37 @@ Returns the east asian width assigned to the character *chr* as string. + ==== ============ + Code Description + ==== ============ + A Ambiguous + F Fullwidth + H Halfwidth + N Neutral + Na Narrow + W Wide + ==== ============ .. function:: mirrored(chr) Returns the mirrored property assigned to the character *chr* as integer. Returns ``1`` if the character has been identified as a "mirrored" - character in bidirectional text, ``0`` otherwise. + character in bidirectional text, ``0`` otherwise. For example,:: + >>> unicodedata.mirrored('>') + 1 + .. function:: decomposition(chr) Returns the character decomposition mapping assigned to the character *chr* as string. An empty string is returned in case no such mapping is - defined. + defined. For example,:: + >>> unicodedata.decomposition('è') + '0065 0300' + .. function:: normalize(form, unistr) Return the normal form *form* for the Unicode string *unistr*. Valid values for @@ -157,6 +311,3 @@ ValueError: not a decimal >>> unicodedata.category('A') # 'L'etter, 'u'ppercase 'Lu' - >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber - 'AN' -