Index: Doc/howto/unicode.rst =================================================================== --- Doc/howto/unicode.rst (revision 86507) +++ Doc/howto/unicode.rst (working copy) @@ -4,14 +4,12 @@ Unicode HOWTO ***************** -:Release: 1.11 +:Release: 1.12 -This HOWTO discusses Python 2.x's support for Unicode, and explains +This HOWTO discusses Python support for Unicode, and explains various problems that people commonly encounter when trying to work -with Unicode. (This HOWTO has not yet been updated to cover the 3.x -versions of Python.) +with Unicode. - Introduction to Unicode ======================= @@ -65,7 +63,7 @@ goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn't enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in -base-16). +base 16). There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 1.1 @@ -90,7 +88,7 @@ The Unicode standard describes how characters are represented by **code points**. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12ca to mean the -character with value 0x12ca (4810 decimal). The Unicode standard contains a lot +character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:: 0061 'a'; LATIN SMALL LETTER A @@ -117,10 +115,10 @@ --------- To summarize the previous section: a Unicode string is a sequence of code -points, which are numbers from 0 to 0x10ffff. This sequence needs to be -represented as a set of bytes (meaning, values from 0-255) in memory. The rules -for translating a Unicode string into a sequence of bytes are called an -**encoding**. +points, which are numbers from 0 to 0x10ffff (1,114,111 decimal). This + sequence needs to be represented as a set of bytes (meaning, values +from 0-255) in memory. The rules for translating a Unicode string +into a sequence of bytes are called an **encoding**. The first encoding you might think of is an array of 32-bit integers. In this representation, the string "Python" would look like this:: @@ -265,7 +263,7 @@ UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte >>> b'\x80abc'.decode("utf-8", "replace") - '\ufffdabc' + '�abc' >>> b'\x80abc'.decode("utf-8", "ignore") 'abc' @@ -281,10 +279,10 @@ built-in :func:`ord` function that takes a one-character Unicode string and returns the code point value:: - >>> chr(40960) - '\ua000' - >>> ord('\ua000') - 40960 + >>> chr(57344) + '\ue000' + >>> ord('\ue000') + 57344 Converting to Bytes ------------------- @@ -326,7 +324,8 @@ In Python source code, specific Unicode code points can be written using the ``\u`` escape sequence, which is followed by four hex digits giving the code -point. The ``\U`` escape sequence is similar, but expects 8 hex digits, not 4:: +point. The ``\U`` escape sequence is similar, but expects eight base 16 +digits, not four:: >>> s = "a\xac\u1234\u20ac\U00008000" ^^^^ two-digit hex escape @@ -465,18 +464,17 @@ Reading Unicode from a file is therefore simple:: - f = open('unicode.rst', encoding='utf-8') - for line in f: - print(repr(line)) + with open('unicode.rst', encoding='utf-8') as f: + for line in f: + print(repr(line)) It's also possible to open files in update mode, allowing both reading and writing:: - f = open('test', encoding='utf-8', mode='w+') - f.write('\u4500 blah blah blah\n') - f.seek(0) - print(repr(f.readline()[:1])) - f.close() + with open('test', encoding='utf-8', mode='w+') as f: + f.write('\u4500 blah blah blah\n') + f.seek(0) + print(repr(f.readline()[:1])) The Unicode character U+FEFF is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection @@ -513,14 +511,13 @@ automatically converted to the right encoding for you:: filename = 'filename\u4500abc' - f = open(filename, 'w') - f.write('blah\n') - f.close() + with open(filename, 'w') as f: + f.write('blah\n') Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode filenames. -:func:`os.listdir`, which returns filenames, raises an issue: should it return +Function :func:`os.listdir`, which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return byte strings containing the encoded versions? :func:`os.listdir` will do both, depending on whether you provided the directory path as a byte string or a Unicode string. If you pass a @@ -569,14 +566,6 @@ two different kinds of strings. There is no automatic encoding or decoding if you do e.g. ``str + bytes``, a :exc:`TypeError` is raised for this expression. -It's easy to miss such problems if you only test your software with data that -doesn't contain any accents; everything will seem to work, but there's actually -a bug in your program waiting for the first user who attempts to use characters -> 127. A second tip, therefore, is: - - Include characters > 127 and, even better, characters > 255 in your test - data. - When using data coming from a web browser or some other untrusted source, a common technique is to check for illegal characters in a string before using the string in a generated command line or storing it in a database. If you're doing @@ -594,8 +583,8 @@ if '/' in filename: raise ValueError("'/' not allowed in filenames") unicode_name = filename.decode(encoding) - f = open(unicode_name, 'r') - # ... return contents of file ... + with open(unicode_name, 'r') as f: + # ... return contents of file ... However, if an attacker could specify the ``'base64'`` encoding, they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string @@ -610,7 +599,7 @@ Applications in Python" are available at and discuss questions of character encodings as well as how to internationalize -and localize an application. +and localize an application. These slides cover Python 2.x only. Revision History and Acknowledgements