Index: Doc/library/unicodedata.rst
===================================================================
--- Doc/library/unicodedata.rst	(revision 87144)
+++ Doc/library/unicodedata.rst	(working copy)
@@ -13,62 +13,201 @@
    single: character
    pair: Unicode; database
 
-This module provides access to the Unicode Character Database which defines
-character properties for all Unicode characters. The data in this database is
-based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly
-available from ftp://ftp.unicode.org/.
+This module provides access to the Unicode Character Database (UCD) which
+defines character properties for all Unicode characters. The data contained in
+this database is compiled from the `UCD version 6.0.0
+<http://www.unicode.org/Public/6.0.0/ucd>`_.
 
-The module uses the same names and symbols as defined by the UnicodeData File
-Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html).
-It defines the following functions:
+The module uses the same names and symbols as defined by Unicode Standard Annex
+#44, `"Unicode Character Database (UCD)"
+<http://www.unicode.org/reports/tr44/tr44-6.html>`_.  It defines the following
+functions:
 
 
 .. function:: lookup(name)
 
-   Look up character by name.  If a character with the given name is found, return
-   the corresponding character.  If not found, :exc:`KeyError` is raised.
+   Look up character by name.  If a character with the given name is found,
+   return the corresponding character.  If not found, :exc:`KeyError` is raised.
+   For example,::
 
+      >>> unicodedata.lookup('PILCROW SIGN')
+      '¶'
 
+   The characters returned by this function are the same as those produced by
+   ``\N`` escape sequence in string literals::
+
+      >>> unicodedata.lookup('MIDDLE DOT') == '\N{MIDDLE DOT}'
+      True
+
 .. function:: name(chr[, default])
 
    Returns the name assigned to the character *chr* as a string. If no
    name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
-   raised.
+   raised.  For example,::
 
+      >>> unicodedata.name('Ӝ')
+      'CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS'
 
+      >>> unicodedata.name('\uFFFF', 'no name')
+      'no name'
+
 .. function:: decimal(chr[, default])
 
    Returns the decimal value assigned to the character *chr* as integer.
    If no such value is defined, *default* is returned, or, if not given,
-   :exc:`ValueError` is raised.
+   :exc:`ValueError` is raised.  For example,::
 
+      >>> unicodedata.decimal('\N{ARABIC-INDIC DIGIT NINE}')
+      9
 
+      >>> unicodedata.decimal('\N{SUPERSCRIPT NINE}', -1)
+      -1
+
+
 .. function:: digit(chr[, default])
 
    Returns the digit value assigned to the character *chr* as integer.
    If no such value is defined, *default* is returned, or, if not given,
-   :exc:`ValueError` is raised.
+   :exc:`ValueError` is raised.  For example,::
 
+      >>> unicodedata.digit('\N{SUPERSCRIPT NINE}')
+      9
 
+      >>> unicodedata.digit('\N{ROMAN NUMERAL NINE}', -1)
+      -1
+
+
 .. function:: numeric(chr[, default])
 
    Returns the numeric value assigned to the character *chr* as float.
    If no such value is defined, *default* is returned, or, if not given,
    :exc:`ValueError` is raised.
 
+      >>> unicodedata.numeric('½')
+      0.5
 
+      >>> unicodedata.numeric('\N{ROMAN NUMERAL TEN THOUSAND}')
+      10000.0
+
+
 .. function:: category(chr)
 
-   Returns the general category assigned to the character *chr* as
-   string.
+   Returns the general category assigned to the character *chr* as string.
+   General category names consist of two letters.  The first letter is always
+   uppercase and denotes one of seven major categories: Letter (L), Mark (M),
+   Number (N), Punctuation (P), Symbol (S), Separator (Z), and Other (C).  The
+   second letter is always lowercase and further subdivides major categories
+   into minor subcategories.
 
+   +--------------------------------------------------------------------------+
+   | **General Categories**                                                   |
+   +----+-------------+------------------+------------------------------------+
+   |Name|Major        |Minor             |Examples                            |
+   +====+=============+==================+====================================+
+   |Lu  | Letter      | uppercase        |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Ll  | Letter      | lowercase        |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Lt  | Letter      | titlecase        |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Lm  | Letter      | modifier         |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Lo  | Letter      | other            |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Mn  | Mark        | nonspacing       |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Mc  | Mark        | spacing combining|                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Me  | Mark        | enclosing        |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Nd  | Number      | decimal digit    |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Nl  | Number      | letter           |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |No  | Number      | other            |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Pc  | Punctuation | connector        |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Pd  | Punctuation | dash             |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Ps  | Punctuation | open             |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Pe  | Punctuation | close            |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Pi  | Punctuation | initial quote    |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Pf  | Punctuation | final quote      |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Po  | Punctuation | other            |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Sm  | Symbol      | math             |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Sc  | Symbol      | currency         |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Sk  | Symbol      | modifier         |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |So  | Symbol      | other            |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Zs  | Separator   | space            |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Zl  | Separator   | line             |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Zp  | Separator   | paragraph        |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Cc  | Other       | control          |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Cf  | Other       | format           |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Cs  | Other       | surrogate        |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Co  | Other       | private use      |                                    |
+   +----+-------------+------------------+------------------------------------+
+   |Cn  | Other       | not assigned     |                                    |
+   +----+-------------+------------------+------------------------------------+
 
+   The following example program produces code point counts by major category:
+
+   .. literalinclude:: ../includes/unistat.py
+
+   ::
+
+      Counter({'C': 1004868, 'L': 100520, 'S': 5508, 'M': 1498, 'N': 1100, 'P': 598, 'Z': 20})
+
 .. function:: bidirectional(chr)
 
-   Returns the bidirectional category assigned to the character *chr* as
-   string. If no such value is defined, an empty string is returned.
+   Returns the bidirectional class assigned to the character *chr* as
+   string. If no such value is defined, an empty string is returned. For example,::
 
+      >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber
+      'AN'
 
+   Bidirectional class names returned by this function have the following meaning:
+
+   =====     =========================
+   Class      Description
+   =====     =========================
+   AL         Arabic Letter	       
+   AN         Arabic Number	       
+   B          Paragraph Separator	       
+   BN         Boundary Neutral	       
+   CS         Common Separator	       
+   EN         European Number	       
+   ES         European Separator	       
+   ET         European Terminator	       
+   L          Left To Right	       
+   LRE        Left To Right Embedding    
+   LRO        Left To Right Override     
+   NSM        Nonspacing Mark	       
+   ON         Other Neutral	       
+   PDF        Pop Directional Format     
+   R          Right To Left	       
+   RLE        Right To Left Embedding    
+   RLO        Right To Left Override     
+   S          Segment Separator	       
+   WS         White Space                
+   =====     =========================
+
+
 .. function:: combining(chr)
 
    Returns the canonical combining class assigned to the character *chr*
@@ -80,21 +219,37 @@
    Returns the east asian width assigned to the character *chr* as
    string.
 
+   ====      ============
+   Code      Description
+   ====      ============
+   A          Ambiguous 
+   F          Fullwidth 
+   H          Halfwidth 
+   N          Neutral   
+   Na         Narrow    
+   W          Wide      
+   ====      ============
 
 .. function:: mirrored(chr)
 
    Returns the mirrored property assigned to the character *chr* as
    integer. Returns ``1`` if the character has been identified as a "mirrored"
-   character in bidirectional text, ``0`` otherwise.
+   character in bidirectional text, ``0`` otherwise. For example,::
 
+      >>> unicodedata.mirrored('>')
+      1
 
+
 .. function:: decomposition(chr)
 
    Returns the character decomposition mapping assigned to the character
    *chr* as string. An empty string is returned in case no such mapping is
-   defined.
+   defined.  For example,::
 
+      >>> unicodedata.decomposition('è')
+      '0065 0300'
 
+
 .. function:: normalize(form, unistr)
 
    Return the normal form *form* for the Unicode string *unistr*. Valid values for
@@ -156,6 +311,3 @@
    ValueError: not a decimal
    >>> unicodedata.category('A')  # 'L'etter, 'u'ppercase
    'Lu'
-   >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber
-   'AN'
-
Index: Doc/includes/unistat.py
===================================================================
--- Doc/includes/unistat.py	(revision 0)
+++ Doc/includes/unistat.py	(revision 0)
@@ -0,0 +1,9 @@
+import unicodedata
+from collections import Counter
+
+catcount = Counter()
+for i in range(0x110000):
+    cat = unicodedata.category(chr(i))[0]
+    catcount[cat] += 1
+
+print(catcount)

Property changes on: Doc/includes/unistat.py
___________________________________________________________________
Added: svn:keywords
   + Id
Added: svn:eol-style
   + native