# HG changeset patch # Parent 56f71f02206ebe3413caa53b614403617ac544ed Issue #19548: Documentation improvements for “codecs” diff -r 56f71f02206e Doc/library/codecs.rst --- a/Doc/library/codecs.rst Tue Dec 16 18:17:18 2014 -0800 +++ b/Doc/library/codecs.rst Thu Dec 18 02:10:33 2014 +0000 @@ -18,8 +18,10 @@ pair: stackable; streams This module defines base classes for standard Python codecs (encoders and -decoders) and provides access to the internal Python codec registry which -manages the codec and error handling lookup process. +decoders) and provides access to the internal Python codec registry, which +manages the codec and error handling lookup process. Most codecs encode text +to bytes. There are also special codecs that encode text to text, or +bytes to bytes. It defines the following functions: @@ -47,52 +49,76 @@ Register a codec search function. Search functions are expected to take one argument, the encoding name in all lower case letters, and return a - :class:`CodecInfo` object having the following attributes: + :class:`CodecInfo` object. In case a search function cannot find + a given encoding, it should return ``None``. Search function registration + is not reversible, which doesn't play well with module reloading. - * ``name`` The name of the encoding; - * ``encode`` The stateless encoding function; +.. function:: lookup(encoding) - * ``decode`` The stateless decoding function; + Looks up the codec info in the Python codec registry and returns a + :class:`CodecInfo` object as defined below. - * ``incrementalencoder`` An incremental encoder class or factory function; + Encodings are first looked up in the registry's cache. If not found, the list of + registered search functions is scanned. If no :class:`CodecInfo` object is + found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object + is stored in the cache and returned to the caller. - * ``incrementaldecoder`` An incremental decoder class or factory function; - * ``streamwriter`` A stream writer class or factory function; +.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None) - * ``streamreader`` A stream reader class or factory function. + Codec details when looking up the codec registry. The constructor + arguments are stored in attributes of the same name: - The various functions or classes take the following arguments: - *encode* and *decode*: These must be functions or methods which have the same - interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec - instances (see :ref:`Codec Interface `). The functions/methods - are expected to work in a stateless mode. + .. data:: name - *incrementalencoder* and *incrementaldecoder*: These have to be factory - functions providing the following interface: + The name of the encoding. - ``factory(errors='strict')`` - The factory functions must return objects providing the interfaces defined by - the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`, - respectively. Incremental codecs can maintain state. + .. data:: encode + decode - *streamreader* and *streamwriter*: These have to be factory functions providing - the following interface: + The stateless encoding and decoding functions. These must be + functions or methods which have the same interface as + the :meth:`~Codec.encode` and :meth:`~Codec.decode` methods of Codec + instances (see :ref:`Codec Interface `). + The functions or methods are expected to work in a stateless mode. - ``factory(stream, errors='strict')`` - The factory functions must return objects providing the interfaces defined by - the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively. - Stream codecs can maintain state. + .. data:: incrementalencoder + incrementaldecoder + + Incremental encoder and decoder classes or factory functions. + These have to accept an *errors* argument that defaults to *strict*: + + ``factory(errors='strict')`` + + The factory functions must return objects providing the interfaces + defined by the base classes :class:`IncrementalEncoder` and + :class:`IncrementalDecoder`, respectively. Incremental codecs can + maintain state. + + + .. data:: streamwriter + streamreader + + Stream reader and writer classes or factory functions. These have to + provide the following interface (also defaulting to the *strict* + error handler): + + ``factory(stream, errors='strict')`` + + The factory functions must return objects providing the interfaces + defined by the base classes :class:`StreamReader` and + :class:`StreamWriter`, respectively. Stream codecs can maintain state. + Possible values for errors are * ``'strict'``: raise an exception in case of an encoding error * ``'replace'``: replace malformed data with a suitable replacement marker, - such as ``'?'`` or ``'\ufffd'`` + such as ``b'?'`` or ``'\ufffd'`` * ``'ignore'``: ignore malformed data and continue without further notice * ``'xmlcharrefreplace'``: replace with the appropriate XML character reference (for encoding only) @@ -108,20 +134,6 @@ as well as any other error handling name defined via :func:`register_error`. - In case a search function cannot find a given encoding, it should return - ``None``. - - -.. function:: lookup(encoding) - - Looks up the codec info in the Python codec registry and returns a - :class:`CodecInfo` object as defined above. - - Encodings are first looked up in the registry's cache. If not found, the list of - registered search functions is scanned. If no :class:`CodecInfo` object is - found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object - is stored in the cache and returned to the caller. - To simplify access to the various codecs, the module provides these additional functions which use :func:`lookup` for the codec lookup: @@ -213,8 +225,7 @@ .. function:: replace_errors(exception) Implements the ``replace`` error handling: malformed data is replaced with a - suitable replacement character such as ``'?'`` in bytestrings and - ``'\ufffd'`` in Unicode strings. + suitable replacement marker such as ``b'?'`` in or ``'\ufffd'``. .. function:: ignore_errors(exception) @@ -241,29 +252,23 @@ .. versionadded:: 3.5 -To simplify working with encoded files or stream, the module also defines these -utility functions: +To simplify working with encoded files or streams, +the module also defines these utility functions: -.. function:: open(filename, mode[, encoding[, errors[, buffering]]]) +.. function:: open(filename, mode='r'[, encoding[, errors[, buffering]]]) Open an encoded file using the given *mode* and return a wrapped version - providing transparent encoding/decoding. The default file mode is ``'r'`` - meaning to open the file in read mode. + providing transparent encoding/decoding. The default file mode is + ``'r'``, meaning to open the file in read mode. .. note:: - The wrapped version's methods will accept and return strings only. Bytes - arguments will be rejected. - - .. note:: - - Files are always opened in binary mode, even if no binary mode was - specified. This is done to avoid data loss due to encodings using 8-bit - values. This means that no automatic conversion of ``b'\n'`` is done - on reading and writing. + Underlying encoded files are always opened in binary mode. + No automatic conversion of ``b'\n'`` is done on reading and writing. *encoding* specifies the encoding which is to be used for the file. + Only the codecs whose encoded form is bytes are supported. *errors* may be given to define the error handling. It defaults to ``'strict'`` which causes a :exc:`ValueError` to be raised in case an encoding error occurs. @@ -275,11 +280,13 @@ .. function:: EncodedFile(file, data_encoding, file_encoding=None, errors='strict') Return a wrapped version of file which provides transparent encoding - translation. + translation. The original file is closed when the wrapped version is + closed. - Bytes written to the wrapped file are interpreted according to the given - *data_encoding* and then written to the original file as bytes using the - *file_encoding*. + Data written to the wrapped file is decoded according to the given + *data_encoding* and then encoded to the original file using + *file_encoding*. Data read from the file is first decoded according to + *file_encoding* and then encoded with *data_encoding*. If *file_encoding* is not given, it defaults to *data_encoding*. @@ -317,7 +324,8 @@ BOM_UTF32_BE BOM_UTF32_LE - These constants define various encodings of the Unicode byte order mark (BOM) + These constants define various byte sequences being + Unicode byte order marks (BOMs) for several encodings. They are used in UTF-16 and UTF-32 data streams to indicate the byte order used in the stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's @@ -513,7 +521,8 @@ .. method:: reset() Reset the encoder to the initial state. The output is discarded: call - ``.encode('', final=True)`` to reset the encoder and to get the output. + ``.encode(object, final=True)``, passing an empty byte or text string + if necessary, to reset the encoder and to get the output. .. method:: IncrementalEncoder.getstate() @@ -627,7 +636,7 @@ additional keyword arguments, but only the ones defined here are used by the Python codec registry. - *stream* must be a file-like object open for writing binary data. + *stream* must be a file-like object open for writing. The :class:`StreamWriter` may implement different error handling schemes by providing the *errors* keyword argument. These parameters are predefined: @@ -660,7 +669,7 @@ .. method:: writelines(list) Writes the concatenated list of strings to the stream (possibly by reusing - the :meth:`write` method). + the :meth:`write` method). Does not work on byte encoders. .. method:: reset() @@ -694,7 +703,7 @@ additional keyword arguments, but only the ones defined here are used by the Python codec registry. - *stream* must be a file-like object open for reading (binary) data. + *stream* must be a file-like object open for reading. The :class:`StreamReader` may implement different error handling schemes by providing the *errors* keyword argument. These parameters are defined: @@ -717,17 +726,20 @@ Decodes data from the stream and returns the resulting object. - *chars* indicates the number of characters to read from the - stream. :func:`read` will never return more than *chars* characters, but - it might return less, if there are not enough characters available. + The *chars* argument indicates the number of decoded + characters or bytes to return. The :func:`read` method will + never return more data than requested, but it might return less, + if there is not enough available. - *size* indicates the approximate maximum number of bytes to read from the - stream for decoding purposes. The decoder can modify this setting as + The *size* argument indicates the approximate maximum + number of bytes or characters to read + for decoding purposes. The decoder can modify this setting as appropriate. The default value -1 indicates to read and decode as much as - possible. *size* is intended to prevent having to decode huge files in - one step. + possible. This parameter is intended to + prevent having to decode huge files in one step. - *firstline* indicates that it would be sufficient to only return the first + The *firstline* flag indicates that + it would be sufficient to only return the first line, if there are decoding errors on later lines. The method should use a greedy read strategy meaning that it should read @@ -770,8 +782,7 @@ In addition to the above methods, the :class:`StreamReader` must also inherit all other methods and attributes from the underlying stream. -The next two base classes are included for convenience. They are not needed by -the codec registry, but may provide useful in practice. +The next two concrete classes are included for convenience. .. _stream-reader-writer: @@ -803,7 +814,7 @@ StreamRecoder Objects ^^^^^^^^^^^^^^^^^^^^^ -The :class:`StreamRecoder` provide a frontend - backend view of encoding data +The :class:`StreamRecoder` translates data from one encoding to another, which is sometimes useful when dealing with different encoding environments. The design is such that one can use the factory functions returned by the @@ -815,19 +826,20 @@ Creates a :class:`StreamRecoder` instance which implements a two-way conversion: *encode* and *decode* work on the frontend (the input to :meth:`read` and output of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and - writing to the stream). + writing to the underlying stream). - You can use these objects to do transparent direct recodings from e.g. Latin-1 + You can use these objects to do transparent transcodings from e.g. Latin-1 to UTF-8 and back. - *stream* must be a file-like object. + The *stream* argument must be a file-like object. - *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*, + The *encode* and *decode* arguments must + adhere to the :class:`Codec` interface. *Reader* and *Writer* must be factory functions or classes providing objects of the :class:`StreamReader` and :class:`StreamWriter` interface respectively. - *encode* and *decode* are needed for the frontend translation, *Reader* and - *Writer* for the backend translation. + The *encode* and *decode* arguments are needed for the + frontend translation, *Reader* and *Writer* for the backend translation. Error handling is done in the same way as defined for the stream readers and writers. @@ -1241,23 +1253,19 @@ +--------------------+---------+---------------------------+ | punycode | | Implements :rfc:`3492` | +--------------------+---------+---------------------------+ -| raw_unicode_escape | | Produce a string that is | -| | | suitable as raw Unicode | -| | | literal in Python source | -| | | code | +| raw_unicode_escape | | Produce a string that uses | +| | | Unicode escapes to encode | +| | | non-Latin-1 code points. | +| | | It is used in the Python pickle protocol. | +--------------------+---------+---------------------------+ | undefined | | Raise an exception for | -| | | all conversions. Can be | -| | | used as the system | -| | | encoding if no automatic | -| | | coercion between byte and | -| | | Unicode strings is | -| | | desired. | +| | | all conversions | +--------------------+---------+---------------------------+ | unicode_escape | | Produce a string that is | -| | | suitable as Unicode | -| | | literal in Python source | -| | | code | +| | | suitable as a Unicode | +| | | literal in ASCII-encoded | +| | | Python source code, except that quote | +| | | marks are not escaped | +--------------------+---------+---------------------------+ | unicode_internal | | Return the internal | | | | representation of the | @@ -1272,7 +1280,7 @@ ^^^^^^^^^^^^^^^^^ The following codecs provide binary transforms: :term:`bytes-like object` -to :class:`bytes` mappings. +to :class:`bytes` mappings. They are not supported by :meth:`bytes.decode`. .. tabularcolumns:: |l|L|L|L| @@ -1327,7 +1335,7 @@ ^^^^^^^^^^^^^^^ The following codec provides a text transform: a :class:`str` to :class:`str` -mapping. +mapping. It is not supported by :meth:`str.encode`. .. tabularcolumns:: |l|l|L| diff -r 56f71f02206e Lib/codecs.py --- a/Lib/codecs.py Tue Dec 16 18:17:18 2014 -0800 +++ b/Lib/codecs.py Thu Dec 18 02:10:33 2014 +0000 @@ -341,8 +341,7 @@ """ Creates a StreamWriter instance. - stream must be a file-like object open for writing - (binary) data. + stream must be a file-like object open for writing. The StreamWriter may use different error handling schemes by providing the errors keyword argument. These @@ -416,8 +415,7 @@ """ Creates a StreamReader instance. - stream must be a file-like object open for reading - (binary) data. + stream must be a file-like object open for reading. The StreamReader may use different error handling schemes by providing the errors keyword argument. These @@ -445,13 +443,12 @@ """ Decodes data from the stream self.stream and returns the resulting object. - chars indicates the number of characters to read from the - stream. read() will never return more than chars - characters, but it might return less, if there are not enough - characters available. + chars indicates the number of decoded characters or bytes to + return. read() will never return more data than requested, + but it might return less, if there is not enough available. - size indicates the approximate maximum number of bytes to - read from the stream for decoding purposes. The decoder + size indicates the approximate maximum number of bytes or + characters to read for decoding purposes. The decoder can modify this setting as appropriate. The default value -1 indicates to read and decode as much as possible. size is intended to prevent having to decode huge files in one @@ -462,7 +459,7 @@ will be returned, the rest of the input will be kept until the next call to read(). - The method should use a greedy read strategy meaning that + The method should use a greedy read strategy, meaning that it should read as much data as is allowed within the definition of the encoding and the given size, e.g. if optional encoding endings or state markers are available @@ -597,7 +594,7 @@ def readlines(self, sizehint=None, keepends=True): """ Read all lines available on the input stream - and return them as list of lines. + and return them as a list. Line breaks are implemented using the codec's decoder method and are included in the list entries. @@ -745,19 +742,18 @@ class StreamRecoder: - """ StreamRecoder instances provide a frontend - backend - view of encoding data. + """ StreamRecoder instances translate data from one encoding to another. They use the complete set of APIs returned by the codecs.lookup() function to implement their task. - Data written to the stream is first decoded into an - intermediate format (which is dependent on the given codec - combination) and then written to the stream using an instance + Data written to the StreamRecoder instance is first decoded into an + intermediate format (depending on the "decode" codec) + and then written to the underlying stream using an instance of the provided Writer class. - In the other direction, data is read from the stream using a - Reader instance and then return encoded data to the caller. + In the other direction, data is read from the underlying stream using + a Reader instance and then encoded and returned to the caller. """ # Optional attributes set by the file wrappers below @@ -771,20 +767,19 @@ conversion: encode and decode work on the frontend (the input to .read() and output of .write()) while Reader and Writer work on the backend (reading and - writing to the stream). + writing to the underlying stream). - You can use these objects to do transparent direct - recodings from e.g. latin-1 to utf-8 and back. + You can use these objects to do transparent + transcodings from e.g. latin-1 to utf-8 and back. stream must be a file-like object. - encode, decode must adhere to the Codec interface, Reader, + encode and decode must adhere to the Codec interface; Reader and Writer must be factory functions or classes providing the - StreamReader, StreamWriter interface resp. + StreamReader and StreamWriter interfaces resp. encode and decode are needed for the frontend translation, - Reader and Writer for the backend translation. Unicode is - used as intermediate encoding. + Reader and Writer for the backend translation. Error handling is done in the same way as defined for the StreamWriter/Readers. @@ -859,20 +854,17 @@ ### Shortcuts -def open(filename, mode='rb', encoding=None, errors='strict', buffering=1): +def open(filename, mode='r', encoding=None, errors='strict', buffering=1): """ Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding. - Note: The wrapped version will only accept the object format - defined by the codecs, i.e. Unicode objects for most builtin - codecs. Output is also codec dependent and will usually be - Unicode as well. + Note: The wrapped version will only accept and output the object + format defined by the codecs, i.e. Unicode objects for most builtin + codecs. - Files are always opened in binary mode, even if no binary mode - was specified. This is done to avoid data loss due to encodings - using 8-bit values. The default file mode is 'rb' meaning to - open the file in binary read mode. + Underlying encoded files are always opened in binary mode. + The default file mode is 'r', meaning to open the file in read mode. encoding specifies the encoding which is to be used for the file. @@ -908,13 +900,13 @@ """ Return a wrapped version of file which provides transparent encoding translation. - Strings written to the wrapped file are interpreted according - to the given data_encoding and then written to the original - file as string using file_encoding. The intermediate encoding + Data written to the wrapped file is decoded according + to the given data_encoding and then encoded to the underlying + file using file_encoding. The intermediate data type will usually be Unicode but depends on the specified codecs. - Strings are read from the file using file_encoding and then - passed back to the caller as string using data_encoding. + Data read from the file is decoded using file_encoding and then + passed back to the caller encoded using data_encoding. If file_encoding is not given, it defaults to data_encoding. diff -r 56f71f02206e Modules/_codecsmodule.c --- a/Modules/_codecsmodule.c Tue Dec 16 18:17:18 2014 -0800 +++ b/Modules/_codecsmodule.c Thu Dec 18 02:10:33 2014 +0000 @@ -54,9 +54,9 @@ "register(search_function)\n\ \n\ Register a codec search function. Search functions are expected to take\n\ -one argument, the encoding name in all lower case letters, and return\n\ -a tuple of functions (encoder, decoder, stream_reader, stream_writer)\n\ -(or a CodecInfo object)."); +one argument, the encoding name in all lower case letters, and either\n\ +return None, or a tuple of functions (encoder, decoder, stream_reader,\n\ +stream_writer) (or a CodecInfo object)."); static PyObject *codec_register(PyObject *self, PyObject *search_function)