Index: Include/unicodeobject.h =================================================================== --- Include/unicodeobject.h (revision 86478) +++ Include/unicodeobject.h (working copy) @@ -737,7 +737,7 @@ const char *errors /* error handling */ ); -/* Encodes a Unicode object and returns the result as Python string +/* Encodes a Unicode object and returns the result as Python bytes object. */ PyAPI_FUNC(PyObject*) PyUnicode_AsEncodedString( Index: Doc/c-api/unicode.rst =================================================================== --- Doc/c-api/unicode.rst (revision 86477) +++ Doc/c-api/unicode.rst (working copy) @@ -48,7 +48,6 @@ The following APIs are really C macros and can be used to do fast checks and to access internal read-only data of Unicode objects: - .. c:function:: int PyUnicode_Check(PyObject *o) Return true if the object *o* is a Unicode object or an instance of a Unicode @@ -84,7 +83,6 @@ Return a pointer to the internal buffer of the object. *o* has to be a :c:type:`PyUnicodeObject` (not checked). - .. c:function:: int PyUnicode_ClearFreeList() Clear the free list. Return the total number of freed items. @@ -97,7 +95,6 @@ are available through these macros which are mapped to C functions depending on the Python configuration. - .. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch) Return 1 or 0 depending on whether *ch* is a whitespace character. @@ -194,10 +191,41 @@ Return the character *ch* converted to a double. Return ``-1.0`` if this is not possible. This macro does not raise exceptions. +.. c:function:: PyUnicode_GetMax() + Maximum ordinal for a Unicode character. + + Plain Py_UNICODE """""""""""""""" +The following utility functions are useful when manipulating C arrays +of plain Py_UNICODE characters. These functions operate similarly to +the eponymous ANSI C functions. + +.. c:function:: size_t Py_UNICODE_strlen(const Py_UNICODE *u) + +.. c:function:: Py_UNICODE* Py_UNICODE_strcpy(Py_UNICODE *s1, const Py_UNICODE *s2) + +.. c:function:: Py_UNICODE* Py_UNICODE_strcat(Py_UNICODE *s1, const Py_UNICODE *s2) + +.. c:function:: Py_UNICODE* Py_UNICODE_strncpy(Py_UNICODE *s1, const Py_UNICODE *s2, + size_t n) + +.. c:function:: int Py_UNICODE_strcmp(const Py_UNICODE *s1, const Py_UNICODE *s2) + +.. c:function:: int Py_UNICODE_strncmp(const Py_UNICODE *s1, const Py_UNICODE *s2, + size_t n) + +.. c:function:: Py_UNICODE* Py_UNICODE_strchr(const Py_UNICODE *s, Py_UNICODE c) + + +.. c:function:: Py_UNICODE* Py_UNICODE_strrchr( const Py_UNICODE *s, Py_UNICODE c) + + + +Creating Unicode Objects +"""""""""""""""""""""""" To create Unicode objects and access their basic sequence properties, use these APIs: @@ -227,7 +255,18 @@ Create a Unicode object from an UTF-8 encoded null-terminated char buffer *u*. +.. c:function: PyObject *PyUnicode_FromOrdinal(int ordinal) + Create a Unicode Object from the given Unicode code point ordinal. + The ordinal must be in range(0x11000). A ValueError is raised in + case it is not. + + On narrow Python builds, the result is a string of length 1 for + ordinal in range(0x10000) and a string of length 2 for ordinal in + range(0x10000, 0x11000). In the last case, the two sting units + form a UTF-16 surrogate pair. On wide Python build, the result is + always a string of length 1. + .. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...) Take a C :c:func:`printf`\ -style *format* string and a variable number of @@ -528,7 +567,22 @@ using the Python codec registry. Return *NULL* if an exception was raised by the codec. +.. c:function:: PyObject* PyUnicode_AsDecodedObject(PyObject *unicode, const char *encoding, const char *errors) + Create a Unicode object by decoding the encoded Unicode object + *unicode*. *encoding* and *errors* have the same meaning as the + parameters of the same name in the :func:`unicode` built-in + function. The codec to be used is looked up using the Python codec + registry. Return *NULL* if an exception was raised by the codec. + Note that Python codecs do not accept Unicode objects for decoding, + so this method is only useful with user or 3rd party codecs. + +.. c:function:: PyObject* PyUnicode_AsDecodedUnicode(PyObject *s, const char *encoding, const char *errors) + + Same as c:func:`PyUnicode_AsDecodedObject`, but raises a + :exc:`TypeError` if decoder returns an object with type other than + :type:`str`. + .. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors) Encode the :c:type:`Py_UNICODE` buffer of the given size and return a Python @@ -537,7 +591,6 @@ to be used is looked up using the Python codec registry. Return *NULL* if an exception was raised by the codec. - .. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors) Encode a Unicode object and return the result as Python bytes object. @@ -546,13 +599,65 @@ using the Python codec registry. Return *NULL* if an exception was raised by the codec. +.. c:function:: PyObject* PyUnicode_AsEncodedObject(PyObject *unicode, const char *encoding, const char *errors) -UTF-8 Codecs -"""""""""""" + Use c:func:`PyUnicode_AsEncodedString` instead. + Same as c:func:`PyUnicode_AsEncodedString`, but without shortcuts + for common built-in encodings and without checking the type of the + object returned by encoding via the codec registry. This method is + only useful with user or 3rd party codec that encodes string into + something other than bytes. + +.. c:function:: PyObject* PyUnicode_AsEncodedUnicode(PyObject *unicode, const char *encoding, const char *errors) + + Use c:func:`PyUnicode_AsEncodedString` instead. + + Same as c:func:`PyUnicode_AsEncodedObject`, but raises + :exc:`TypeError` is encoding via the codec registry returns an + object other than string. This method is only useful with user or + 3rd party codec that encodes string into string. + +.. c:function: int PyUnicode_EncodeDecimal(Py_UNICODE *s, Py_ssize_t length, + char *output, const char *errors) + + Takes a Unicode string holding a decimal value and writes it into + an output buffer using standard ASCII digit codes. + + The output buffer has to provide at least length+1 bytes of storage + area. The output string is 0-terminated. + + The encoder converts whitespace to ' ', decimal characters to their + corresponding ASCII digit and all other Latin-1 characters except + \0 as-is. Characters outside this range (Unicode ordinals 1-256) + are treated as errors. This includes embedded NULL bytes. + + Error handling is defined by the errors argument: + + NULL or "strict": raise a ValueError + "ignore": ignore the wrong characters (these are not copied to the + output buffer) + "replace": replaces illegal characters with '?' + + Returns 0 on success, -1 on failure. + + +.. c:function: PyObject* PyUnicode_BuildEncodingMap(PyObject* unicode) + + Build an 8-bit encoding map from a 256 character string *unicode*. + Returns an :type:`EncodingMap` object. + +UTF-8 Codec +""""""""""" + +UTF-8 is the default encoding in Python. + +.. c:function: const char* PyUnicode_GetDefaultEncoding() + + Returns "utf-8". + These are the UTF-8 codec APIs: - .. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string @@ -980,7 +1085,31 @@ Concat two strings giving a new Unicode string. +.. c:function:: void PyUnicode_Append(PyObject **pleft, PyObject *right) + Concat two strings and put the result in *pleft. Sets *pleft to + NULL on error. + +.. c:function:: void PyUnicode_AppendAndDel(PyObject **pleft, PyObject *right) + + Concat two strings and put the result in *pleft and drop the right + object. Sets *pleft to NULL on error. + + +.. c:function: int PyUnicode_Resize(PyObject **unicode, Py_ssize_t length) + + Resize an already allocated Unicode object to the new size length. + + *unicode is modified to point to the new (resized) object and 0 + returned on success. + + This API may only be called by the function which also called the + Unicode constructor. The refcount on the object must be 1. Otherwise, + an error is returned. + + Error handling is implemented as follows: an exception is set, -1 + is returned and *unicode left untouched. + .. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) Split a string giving a list of Unicode strings. If sep is *NULL*, splitting @@ -988,14 +1117,33 @@ separator. At most *maxsplit* splits will be done. If negative, no limit is set. Separators are not included in the resulting list. +.. c:function:: PyObject* PyUnicode_RSplit(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) + Split a string giving a list of Unicode strings. + + If sep is NULL, splitting will be done at all whitespace + substrings. Otherwise, splits occur at the given separator. + + At most maxsplit splits will be done. But unlike c:func:`PyUnicode_Split` + c:func:`PyUnicode_RSplit` splits from the end of the string. If negative, + no limit is set. + + Separators are not included in the resulting list. + .. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend) Split a Unicode string at line breaks, returning a list of Unicode strings. CRLF is considered to be one line break. If *keepend* is 0, the Line break characters are not included in the resulting strings. +.. c:function:: PyObject* PyUnicode_Partition(PyObject *s, PyObject *sep) + Partition a string using a given separator. + +.. c:function:: PyObject* PyUnicode_RPartition(PyObject *s, PyObject *sep) + + Partition a string using a given separator, searching from the end of the string. + .. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors) Translate a string by applying a character mapping table to it and return the @@ -1089,7 +1237,12 @@ *element* has to coerce to a one element Unicode string. ``-1`` is returned if there was an error. +.. c:function:: int PyUnicode_IsIdentifier(PyObject *s) + Check whether argument *s* is a valid identifier and return true or + false accordingly. This function always succeeds. + + .. c:function:: void PyUnicode_InternInPlace(PyObject **string) Intern the argument *\*string* in place. The argument must be the address of a @@ -1102,7 +1255,13 @@ of this function as reference-count-neutral; you own the object after the call if and only if you owned it before the call.) +.. c:function:: void PyUnicode_InternImmortal(PyObject **string) + Use :c:func:`PyUnicode_InternInPlace` instead. + + Same as :c:func:`PyUnicode_InternInPlace`, but the interned string + will never be released. + .. c:function:: PyObject* PyUnicode_InternFromString(const char *v) A combination of :c:func:`PyUnicode_FromString` and