# HG changeset patch # Parent 56f71f02206ebe3413caa53b614403617ac544ed Issue #19548: Documentation improvements for “codecs” diff -r 56f71f02206e Doc/library/codecs.rst --- a/Doc/library/codecs.rst Tue Dec 16 18:17:18 2014 -0800 +++ b/Doc/library/codecs.rst Mon Dec 22 11:11:52 2014 +0000 @@ -18,8 +18,10 @@ pair: stackable; streams This module defines base classes for standard Python codecs (encoders and -decoders) and provides access to the internal Python codec registry which -manages the codec and error handling lookup process. +decoders) and provides access to the internal Python codec registry, which +manages the codec and error handling lookup process. Most codecs encode text +to bytes, but there are also codecs that encode text to text, and +bytes to bytes. It defines the following functions: @@ -46,82 +48,62 @@ .. function:: register(search_function) Register a codec search function. Search functions are expected to take one - argument, the encoding name in all lower case letters, and return a - :class:`CodecInfo` object having the following attributes: - - * ``name`` The name of the encoding; - - * ``encode`` The stateless encoding function; - - * ``decode`` The stateless decoding function; - - * ``incrementalencoder`` An incremental encoder class or factory function; - - * ``incrementaldecoder`` An incremental decoder class or factory function; - - * ``streamwriter`` A stream writer class or factory function; - - * ``streamreader`` A stream reader class or factory function. - - The various functions or classes take the following arguments: - - *encode* and *decode*: These must be functions or methods which have the same - interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec - instances (see :ref:`Codec Interface `). The functions/methods - are expected to work in a stateless mode. - - *incrementalencoder* and *incrementaldecoder*: These have to be factory - functions providing the following interface: - - ``factory(errors='strict')`` - - The factory functions must return objects providing the interfaces defined by - the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`, - respectively. Incremental codecs can maintain state. - - *streamreader* and *streamwriter*: These have to be factory functions providing - the following interface: - - ``factory(stream, errors='strict')`` - - The factory functions must return objects providing the interfaces defined by - the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively. - Stream codecs can maintain state. - - Possible values for errors are - - * ``'strict'``: raise an exception in case of an encoding error - * ``'replace'``: replace malformed data with a suitable replacement marker, - such as ``'?'`` or ``'\ufffd'`` - * ``'ignore'``: ignore malformed data and continue without further notice - * ``'xmlcharrefreplace'``: replace with the appropriate XML character - reference (for encoding only) - * ``'backslashreplace'``: replace with backslashed escape sequences (for - encoding only) - * ``'namereplace'``: replace with ``\N{...}`` escape sequences (for - encoding only) - * ``'surrogateescape'``: on decoding, replace with code points in the Unicode - Private Use Area ranging from U+DC80 to U+DCFF. These private code - points will then be turned back into the same bytes when the - ``surrogateescape`` error handler is used when encoding the data. - (See :pep:`383` for more.) - - as well as any other error handling name defined via :func:`register_error`. - - In case a search function cannot find a given encoding, it should return - ``None``. + argument, being the encoding name in all lower case letters, and return a + :class:`CodecInfo` object. In case a search function cannot find + a given encoding, it should return ``None``. Search function registration + is not reversible, which doesn't play well with module reloading. .. function:: lookup(encoding) Looks up the codec info in the Python codec registry and returns a - :class:`CodecInfo` object as defined above. + :class:`CodecInfo` object as defined below. Encodings are first looked up in the registry's cache. If not found, the list of registered search functions is scanned. If no :class:`CodecInfo` object is found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object is stored in the cache and returned to the caller. + +.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None) + + Codec details when looking up the codec registry. The constructor + arguments are stored in attributes of the same name: + + + .. attribute:: name + + The name of the encoding. + + + .. attribute:: encode + decode + + The stateless encoding and decoding functions. These must be + functions or methods which have the same interface as + the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec + instances (see :ref:`Codec Interface `). + The functions or methods are expected to work in a stateless mode. + + + .. attribute:: incrementalencoder + incrementaldecoder + + Incremental encoder and decoder classes or factory functions. + These have to provide the interface defined by the base classes + :class:`IncrementalEncoder` and :class:`IncrementalDecoder`, + respectively. Incremental codecs can maintain state. + + + .. attribute:: streamwriter + streamreader + + Stream writer and reader classes or factory functions. These have to + provide the interface defined by the base classes + :class:`StreamWriter` and :class:`StreamReader`, respectively. + Stream codecs can maintain state. + + To simplify access to the various codecs, the module provides these additional functions which use :func:`lookup` for the codec lookup: @@ -177,12 +159,12 @@ .. function:: register_error(name, error_handler) Register the error handling function *error_handler* under the name *name*. - *error_handler* will be called during encoding and decoding in case of an error, - when *name* is specified as the errors parameter. + The *error_handler* argument will be called during encoding and decoding + in case of an error, when *name* is specified as the errors parameter. - For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError` + For encoding, *error_handler* will be called with a :exc:`UnicodeEncodeError` instance, which contains information about the location of the error. The - error handler must either raise this or a different exception or return a + error handler must either raise this or a different exception, or return a tuple with a replacement for the unencodable part of the input and a position where encoding should continue. The replacement may be either :class:`str` or :class:`bytes`. If the replacement is bytes, the encoder will simply copy @@ -192,7 +174,7 @@ relative to the end of the input string. If the resulting position is out of bound an :exc:`IndexError` will be raised. - Decoding and translating works similar, except :exc:`UnicodeDecodeError` or + Decoding and translating works similarly, except :exc:`UnicodeDecodeError` or :exc:`UnicodeTranslateError` will be passed to the handler and that the replacement from the error handler will be put into the output directly. @@ -213,8 +195,7 @@ .. function:: replace_errors(exception) Implements the ``replace`` error handling: malformed data is replaced with a - suitable replacement character such as ``'?'`` in bytestrings and - ``'\ufffd'`` in Unicode strings. + suitable replacement marker such as ``b'?'`` in or ``'\ufffd'``. .. function:: ignore_errors(exception) @@ -241,29 +222,23 @@ .. versionadded:: 3.5 -To simplify working with encoded files or stream, the module also defines these -utility functions: +To simplify working with encoded files and streams, +the module also defines these utility functions: -.. function:: open(filename, mode[, encoding[, errors[, buffering]]]) +.. function:: open(filename, mode='r'[, encoding[, errors[, buffering]]]) Open an encoded file using the given *mode* and return a wrapped version - providing transparent encoding/decoding. The default file mode is ``'r'`` - meaning to open the file in read mode. + providing transparent encoding/decoding. The default file mode is + ``'r'``, meaning to open the file in read mode. .. note:: - The wrapped version's methods will accept and return strings only. Bytes - arguments will be rejected. - - .. note:: - - Files are always opened in binary mode, even if no binary mode was - specified. This is done to avoid data loss due to encodings using 8-bit - values. This means that no automatic conversion of ``b'\n'`` is done - on reading and writing. + Underlying encoded files are always opened in binary mode. + No automatic conversion of ``'\n'`` is done on reading and writing. *encoding* specifies the encoding which is to be used for the file. + Only the codecs whose encoded form is bytes are supported. *errors* may be given to define the error handling. It defaults to ``'strict'`` which causes a :exc:`ValueError` to be raised in case an encoding error occurs. @@ -275,11 +250,15 @@ .. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict') Return a wrapped version of file which provides transparent encoding - translation. + translation. The original file is closed when the wrapped version is + closed. - Bytes written to the wrapped file are interpreted according to the given - *data_encoding* and then written to the original file as bytes using the - *file_encoding*. + Data written to the wrapped file is decoded according to the given + *data_encoding* and then encoded to the original file using + *file_encoding*. Bytes read from the original file are decoded + according to *file_encoding*, and the result is encoded + using *data_encoding*. Reading is not supported when the wrapped file is + a text stream. If *file_encoding* is not given, it defaults to *data_encoding*. @@ -291,14 +270,16 @@ .. function:: iterencode(iterator, encoding, errors='strict', **kwargs) Uses an incremental encoder to iteratively encode the input provided by - *iterator*. This function is a :term:`generator`. *errors* (as well as any + *iterator*. This function is a :term:`generator`. + The *errors* argument (as well as any other keyword argument) is passed through to the incremental encoder. .. function:: iterdecode(iterator, encoding, errors='strict', **kwargs) Uses an incremental decoder to iteratively decode the input provided by - *iterator*. This function is a :term:`generator`. *errors* (as well as any + *iterator*. This function is a :term:`generator`. + The *errors* argument (as well as any other keyword argument) is passed through to the incremental decoder. @@ -317,9 +298,10 @@ BOM_UTF32_BE BOM_UTF32_LE - These constants define various encodings of the Unicode byte order mark (BOM) - used in UTF-16 and UTF-32 data streams to indicate the byte order used in the - stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either + These constants define various byte sequences, + being Unicode byte order marks (BOMs) for several encodings. They are + used in UTF-16 and UTF-32 data streams to indicate the byte order used, + and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`, :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for @@ -344,8 +326,8 @@ The :class:`Codec` class defines the interface for stateless encoders/decoders. -To simplify and standardize error handling, the :meth:`~Codec.encode` and -:meth:`~Codec.decode` methods may implement different error handling schemes by +To simplify and standardize error handling, +codecs may implement different error handling schemes by providing the *errors* string argument. The following string values are defined and implemented by all standard Python codecs: @@ -357,13 +339,13 @@ | ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); | | | this is the default. | +-------------------------+-----------------------------------------------+ -| ``'ignore'`` | Ignore the character and continue with the | -| | next. | +| ``'ignore'`` | Ignore the malformed data and continue | +| | without further notice. | +-------------------------+-----------------------------------------------+ | ``'replace'`` | Replace with a suitable replacement | -| | character; Python will use the official | +| | marker; Python will use the official | | | U+FFFD REPLACEMENT CHARACTER for the built-in | -| | Unicode codecs on decoding and '?' on | +| | Unicode codecs on decoding, and '?' on | | | encoding. | +-------------------------+-----------------------------------------------+ | ``'xmlcharrefreplace'`` | Replace with the appropriate XML character | @@ -375,8 +357,12 @@ | ``'namereplace'`` | Replace with ``\N{...}`` escape sequences | | | (only for encoding). | +-------------------------+-----------------------------------------------+ -| ``'surrogateescape'`` | Replace byte with surrogate U+DCxx, as defined| -| | in :pep:`383`. | +| ``'surrogateescape'`` | On decoding, replace byte with individual | +| | surrogate code ranging from U+DC80 to U+DCFF. | +| | This code will then be turned back into the | +| | same byte when the ``surrogateescape`` error | +| | handler is used when encoding the data. (See | +| | :pep:`383` for more.) | +-------------------------+-----------------------------------------------+ In addition, the following error handlers are specific to Unicode encoding @@ -471,7 +457,7 @@ define in order to be compatible with the Python codec registry. -.. class:: IncrementalEncoder([errors]) +.. class:: IncrementalEncoder(errors='strict') Constructor for an :class:`IncrementalEncoder` instance. @@ -513,7 +499,8 @@ .. method:: reset() Reset the encoder to the initial state. The output is discarded: call - ``.encode('', final=True)`` to reset the encoder and to get the output. + ``.encode(object, final=True)``, passing an empty byte or text string + if necessary, to reset the encoder and to get the output. .. method:: IncrementalEncoder.getstate() @@ -541,7 +528,7 @@ define in order to be compatible with the Python codec registry. -.. class:: IncrementalDecoder([errors]) +.. class:: IncrementalDecoder(errors='strict') Constructor for an :class:`IncrementalDecoder` instance. @@ -619,7 +606,7 @@ compatible with the Python codec registry. -.. class:: StreamWriter(stream[, errors]) +.. class:: StreamWriter(stream, errors='strict') Constructor for a :class:`StreamWriter` instance. @@ -627,7 +614,7 @@ additional keyword arguments, but only the ones defined here are used by the Python codec registry. - *stream* must be a file-like object open for writing binary data. + *stream* must be a file-like object open for writing. The :class:`StreamWriter` may implement different error handling schemes by providing the *errors* keyword argument. These parameters are predefined: @@ -660,7 +647,7 @@ .. method:: writelines(list) Writes the concatenated list of strings to the stream (possibly by reusing - the :meth:`write` method). + the :meth:`write` method). Does not work on byte encoders. .. method:: reset() @@ -686,7 +673,7 @@ compatible with the Python codec registry. -.. class:: StreamReader(stream[, errors]) +.. class:: StreamReader(stream, errors='strict') Constructor for a :class:`StreamReader` instance. @@ -694,7 +681,7 @@ additional keyword arguments, but only the ones defined here are used by the Python codec registry. - *stream* must be a file-like object open for reading (binary) data. + *stream* must be a file-like object open for reading. The :class:`StreamReader` may implement different error handling schemes by providing the *errors* keyword argument. These parameters are defined: @@ -717,17 +704,20 @@ Decodes data from the stream and returns the resulting object. - *chars* indicates the number of characters to read from the - stream. :func:`read` will never return more than *chars* characters, but - it might return less, if there are not enough characters available. + The *chars* argument indicates the number of decoded + characters or bytes to return. The :func:`read` method will + never return more data than requested, but it might return less, + if there is not enough available. - *size* indicates the approximate maximum number of bytes to read from the - stream for decoding purposes. The decoder can modify this setting as + The *size* argument indicates the approximate maximum + number of encoded bytes or characters to read + for decoding. The decoder can modify this setting as appropriate. The default value -1 indicates to read and decode as much as - possible. *size* is intended to prevent having to decode huge files in - one step. + possible. This parameter is intended to + prevent having to decode huge files in one step. - *firstline* indicates that it would be sufficient to only return the first + The *firstline* flag indicates that + it would be sufficient to only return the first line, if there are decoding errors on later lines. The method should use a greedy read strategy meaning that it should read @@ -770,8 +760,7 @@ In addition to the above methods, the :class:`StreamReader` must also inherit all other methods and attributes from the underlying stream. -The next two base classes are included for convenience. They are not needed by -the codec registry, but may provide useful in practice. +The next two concrete classes are included for convenience. .. _stream-reader-writer: @@ -803,7 +792,7 @@ StreamRecoder Objects ^^^^^^^^^^^^^^^^^^^^^ -The :class:`StreamRecoder` provide a frontend - backend view of encoding data +The :class:`StreamRecoder` translates data from one encoding to another, which is sometimes useful when dealing with different encoding environments. The design is such that one can use the factory functions returned by the @@ -813,22 +802,20 @@ .. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors) Creates a :class:`StreamRecoder` instance which implements a two-way conversion: - *encode* and *decode* work on the frontend (the input to :meth:`read` and output - of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and - writing to the stream). + *encode* and *decode* work on the frontend — the data visible to + code calling :meth:`read` and :meth:`write`, while *Reader* and *Writer* + work on the backend — the data in *stream*. - You can use these objects to do transparent direct recodings from e.g. Latin-1 + You can use these objects to do transparent transcodings from e.g. Latin-1 to UTF-8 and back. - *stream* must be a file-like object. + The *stream* argument must be a file-like object. - *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*, + The *encode* and *decode* arguments must + adhere to the :class:`Codec` interface. *Reader* and *Writer* must be factory functions or classes providing objects of the :class:`StreamReader` and :class:`StreamWriter` interface respectively. - *encode* and *decode* are needed for the frontend translation, *Reader* and - *Writer* for the backend translation. - Error handling is done in the same way as defined for the stream readers and writers. @@ -1241,23 +1228,26 @@ +--------------------+---------+---------------------------+ | punycode | | Implements :rfc:`3492` | +--------------------+---------+---------------------------+ -| raw_unicode_escape | | Produce a string that is | -| | | suitable as raw Unicode | -| | | literal in Python source | -| | | code | +| raw_unicode_escape | | Latin-1 encoding with | +| | | ``\uXXXX`` and | +| | | ``\UXXXXXXXX`` for other | +| | | code points. Backslashes | +| | | are not escaped, so may | +| | | be decoded differently. | +| | | It is used in the Python | +| | | pickle protocol. | +--------------------+---------+---------------------------+ | undefined | | Raise an exception for | -| | | all conversions. Can be | -| | | used as the system | -| | | encoding if no automatic | -| | | coercion between byte and | -| | | Unicode strings is | -| | | desired. | +| | | all conversions, except | +| | | decoding ``b''`` | +--------------------+---------+---------------------------+ -| unicode_escape | | Produce a string that is | -| | | suitable as Unicode | -| | | literal in Python source | -| | | code | +| unicode_escape | | Encoding suitable as the | +| | | contents of a Unicode | +| | | literal in ASCII-encoded | +| | | Python source code, | +| | | except that quotes are | +| | | not escaped. Decodes from | +| | | Latin-1 source code. | +--------------------+---------+---------------------------+ | unicode_internal | | Return the internal | | | | representation of the | @@ -1272,7 +1262,7 @@ ^^^^^^^^^^^^^^^^^ The following codecs provide binary transforms: :term:`bytes-like object` -to :class:`bytes` mappings. +to :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`. .. tabularcolumns:: |l|L|L|L| @@ -1327,7 +1317,7 @@ ^^^^^^^^^^^^^^^ The following codec provides a text transform: a :class:`str` to :class:`str` -mapping. +mapping. It is not supported by :meth:`str.encode`. .. tabularcolumns:: |l|l|L| diff -r 56f71f02206e Lib/codecs.py --- a/Lib/codecs.py Tue Dec 16 18:17:18 2014 -0800 +++ b/Lib/codecs.py Mon Dec 22 11:11:52 2014 +0000 @@ -341,8 +341,7 @@ """ Creates a StreamWriter instance. - stream must be a file-like object open for writing - (binary) data. + stream must be a file-like object open for writing. The StreamWriter may use different error handling schemes by providing the errors keyword argument. These @@ -416,8 +415,7 @@ """ Creates a StreamReader instance. - stream must be a file-like object open for reading - (binary) data. + stream must be a file-like object open for reading. The StreamReader may use different error handling schemes by providing the errors keyword argument. These @@ -445,13 +443,12 @@ """ Decodes data from the stream self.stream and returns the resulting object. - chars indicates the number of characters to read from the - stream. read() will never return more than chars - characters, but it might return less, if there are not enough - characters available. + chars indicates the number of decoded characters or bytes to + return. read() will never return more data than requested, + but it might return less, if there is not enough available. - size indicates the approximate maximum number of bytes to - read from the stream for decoding purposes. The decoder + size indicates the approximate maximum number of decoded + bytes or characters to read for decoding. The decoder can modify this setting as appropriate. The default value -1 indicates to read and decode as much as possible. size is intended to prevent having to decode huge files in one @@ -462,7 +459,7 @@ will be returned, the rest of the input will be kept until the next call to read(). - The method should use a greedy read strategy meaning that + The method should use a greedy read strategy, meaning that it should read as much data as is allowed within the definition of the encoding and the given size, e.g. if optional encoding endings or state markers are available @@ -597,7 +594,7 @@ def readlines(self, sizehint=None, keepends=True): """ Read all lines available on the input stream - and return them as list of lines. + and return them as a list. Line breaks are implemented using the codec's decoder method and are included in the list entries. @@ -745,19 +742,18 @@ class StreamRecoder: - """ StreamRecoder instances provide a frontend - backend - view of encoding data. + """ StreamRecoder instances translate data from one encoding to another. They use the complete set of APIs returned by the codecs.lookup() function to implement their task. - Data written to the stream is first decoded into an - intermediate format (which is dependent on the given codec - combination) and then written to the stream using an instance - of the provided Writer class. + Data written to the StreamRecoder is first decoded into an + intermediate format (depending on the "decode" codec) and then + written to the underlying stream using an instance of the provided + Writer class. - In the other direction, data is read from the stream using a - Reader instance and then return encoded data to the caller. + In the other direction, data is read from the underlying stream using + a Reader instance and then encoded and returned to the caller. """ # Optional attributes set by the file wrappers below @@ -769,22 +765,17 @@ """ Creates a StreamRecoder instance which implements a two-way conversion: encode and decode work on the frontend (the - input to .read() and output of .write()) while - Reader and Writer work on the backend (reading and - writing to the stream). + data visible to .read() and .write()) while Reader and Writer + work on the backend (the data in stream). - You can use these objects to do transparent direct - recodings from e.g. latin-1 to utf-8 and back. + You can use these objects to do transparent + transcodings from e.g. latin-1 to utf-8 and back. stream must be a file-like object. - encode, decode must adhere to the Codec interface, Reader, + encode and decode must adhere to the Codec interface; Reader and Writer must be factory functions or classes providing the - StreamReader, StreamWriter interface resp. - - encode and decode are needed for the frontend translation, - Reader and Writer for the backend translation. Unicode is - used as intermediate encoding. + StreamReader and StreamWriter interfaces resp. Error handling is done in the same way as defined for the StreamWriter/Readers. @@ -859,20 +850,17 @@ ### Shortcuts -def open(filename, mode='rb', encoding=None, errors='strict', buffering=1): +def open(filename, mode='r', encoding=None, errors='strict', buffering=1): """ Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding. Note: The wrapped version will only accept the object format defined by the codecs, i.e. Unicode objects for most builtin - codecs. Output is also codec dependent and will usually be - Unicode as well. + codecs. - Files are always opened in binary mode, even if no binary mode - was specified. This is done to avoid data loss due to encodings - using 8-bit values. The default file mode is 'rb' meaning to - open the file in binary read mode. + Underlying encoded files are always opened in binary mode. + The default file mode is 'r', meaning to open the file in read mode. encoding specifies the encoding which is to be used for the file. @@ -908,13 +896,13 @@ """ Return a wrapped version of file which provides transparent encoding translation. - Strings written to the wrapped file are interpreted according - to the given data_encoding and then written to the original - file as string using file_encoding. The intermediate encoding + Data written to the wrapped file is decoded according + to the given data_encoding and then encoded to the underlying + file using file_encoding. The intermediate data type will usually be Unicode but depends on the specified codecs. - Strings are read from the file using file_encoding and then - passed back to the caller as string using data_encoding. + Bytes read from the file are decoded using file_encoding and then + passed back to the caller encoded using data_encoding. If file_encoding is not given, it defaults to data_encoding. diff -r 56f71f02206e Modules/_codecsmodule.c --- a/Modules/_codecsmodule.c Tue Dec 16 18:17:18 2014 -0800 +++ b/Modules/_codecsmodule.c Mon Dec 22 11:11:52 2014 +0000 @@ -54,9 +54,9 @@ "register(search_function)\n\ \n\ Register a codec search function. Search functions are expected to take\n\ -one argument, the encoding name in all lower case letters, and return\n\ -a tuple of functions (encoder, decoder, stream_reader, stream_writer)\n\ -(or a CodecInfo object)."); +one argument, the encoding name in all lower case letters, and either\n\ +return None, or a tuple of functions (encoder, decoder, stream_reader,\n\ +stream_writer) (or a CodecInfo object)."); static PyObject *codec_register(PyObject *self, PyObject *search_function)