Author tchrist
Recipients Rhamphoryncus, amaury.forgeotdarc, belopolsky, doerwalter, eric.smith, ezio.melotti, georg.brandl, lemburg, loewis, pitrou, rhettinger, stutzbach, tchrist, vstinner
Date 2011-08-16.11:04:40
SpamBayes Score 1.11022e-16
Marked as misclassified No
Message-id <26743.1313492664@chthon>
In-reply-to <1313485930.8.0.601749695449.issue10542@psf.upfronthosting.co.za>
Content
>Ezio Melotti <ezio.melotti@gmail.com> added the comment:

>I think the 4 macros:
> #define _Py_UNICODE_ISSURROGATE
> #define _Py_UNICODE_ISHIGHSURROGATE
> #define _Py_UNICODE_ISLOWSURROGATE
> #define _Py_UNICODE_JOIN_SURROGATES
>are quite straightforward and can avoid using the trailing _.

For what it's worth, I've seen Unicode documentation that talks about
that prefers the terms "lead surrogate" and "trail surrogate" as being
clearer than the terms "high surrgoate" and "low   surrogate".

For example, from the Unicode BOM FAQ at http://unicode.org/faq/utf_bom.html

    Q: What are surrogates?

    A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and
       trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D800₁₆ to DBFF₁₆,
       and trailing, or low, surrogates are from DC00₁₆ to DFFF₁₆. They are called surrogates, since they do not
       represent characters directly, but only as a pair.

BTW, considering recent discussions, you might want to read:

    Q: Are there any 16-bit values that are invalid?

    A: The two values FFFE₁₆ and FFFF₁₆ as well as the 32 values from FDD0₁₆ to FDEF₁₆ represent noncharacters. They are
       invalid in interchange, but may be freely used internal to an implementation. Unpaired surrogates are invalid as
       well, i.e. any value in the range D800₁₆ to DBFF₁₆ not followed by a value in the range DC00₁₆ to DFFF₁₆, or any
       value in the range DC00₁₆ to DFFF₁₆ not preceded by a value in the range D800₁₆ to DBFF₁₆. [AF]

and also the answer to:

    Q: Are there any paired surrogates that are invalid?

whose answer I here omit for brevity, as it is a table.

I suspect that you guys are now increasingly sold on the answer to the next FAQ right after that one, now. :)

    Q: Because supplementary characters are uncommon, does that mean I can ignore them?

    A: Just because supplementary characters (expressed with surrogate pairs in UTF-16) are uncommon does 
       not mean that they should be neglected. They include:

        * emoji symbols and emoticons, for interoperating with Japanese mobile phones
        * uncommon (but not unused) CJK characters, important for personal and place names
        * variation selectors for ideographic variation sequences
        * important symbols for mathematics
        * numerous minority scripts and historic scripts, important for some user communities

Another example of using "lead" and "trail" surrogates is in the first
sentence from http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UTF16.html

    * Naming: For clarity, High and Low surrogates are called Lead and Trail in the API, which gives a better sense of
      their ordering in a string. offset16 and offset32 are used to distinguish offsets to UTF-16 boundaries vs offsets
      to UTF-32 boundaries. int char32 is used to contain UTF-32 characters, as opposed to char16, which is a UTF-16
      code unit.
    * Roundtripping Offsets: You can always roundtrip from a UTF-32 offset to a UTF-16 offset and back. Because of the
      difference in structure, you can roundtrip from a UTF-16 offset to a UTF-32 offset and back if and only if
      bounds(string, offset16) != TRAIL.
    * Exceptions: The error checking will throw an exception if indices are out of bounds. Other than than that, all
      methods will behave reasonably, even if unmatched surrogates or out-of-bounds UTF-32 values are present.
      UCharacter.isLegal() can be used to check for validity if desired.
    * Unmatched Surrogates: If the string contains unmatched surrogates, then these are counted as one UTF-32 value.
      This matches their iteration behavior, which is vital. It also matches common display practice as missing glyphs
      (see the Unicode Standard Section 5.4, 5.5).
    * Optimization: The method implementations may need optimization if the compiler doesn't fold static final methods.
      Since surrogate pairs will form an exceeding small percentage of all the text in the world, the singleton case
      should always be optimized for.

You can also see this reflected in the utf.h file from the ICU project as part of their C API in ICU4C:

    #define     U_SENTINEL   (-1)
            This value is intended for sentinel values for APIs that (take or) return single code points (UChar32). 
    #define     U_IS_UNICODE_NONCHAR(c)
            Is this code point a Unicode noncharacter? 
    #define     U_IS_UNICODE_CHAR(c)
            Is c a Unicode code point value (0..U+10ffff) that can be assigned a character? 
    #define     U_IS_BMP(c)   ((uint32_t)(c)<=0xffff)
            Is this code point a BMP code point (U+0000..U+ffff)? 
    #define     U_IS_SUPPLEMENTARY(c)   ((uint32_t)((c)-0x10000)<=0xfffff)
            Is this code point a supplementary code point (U+10000..U+10ffff)? 
    #define     U_IS_LEAD(c)   (((c)&0xfffffc00)==0xd800)
            Is this code point a lead surrogate (U+d800..U+dbff)? 
    #define     U_IS_TRAIL(c)   (((c)&0xfffffc00)==0xdc00)
            Is this code point a trail surrogate (U+dc00..U+dfff)? 
    #define     U_IS_SURROGATE(c)   (((c)&0xfffff800)==0xd800)
            Is this code point a surrogate (U+d800..U+dfff)? 
    #define     U_IS_SURROGATE_LEAD(c)   (((c)&0x400)==0)
            Assuming c is a surrogate code point (U_IS_SURROGATE(c)), is it a lead surrogate? 
    #define     U_IS_SURROGATE_TRAIL(c)   (((c)&0x400)!=0)
            Assuming c is a surrogate code point (U_IS_SURROGATE(c)), is it a trail surrogate?

Another one is:

    http://www.opensource.apple.com/source/WebCore/WebCore-1C25/icu/unicode/utf16.h

which contains:

    #define U16_IS_SINGLE(c) !U_IS_SURROGATE(c)
    #define U16_IS_LEAD(c) (((c)&0xfffffc00)==0xd800)
    #define U16_IS_TRAIL(c) (((c)&0xfffffc00)==0xdc00)
    #define U16_IS_SURROGATE(c) U_IS_SURROGATE(c)
    #define U16_IS_SURROGATE_LEAD(c) (((c)&0x400)==0)
    #define U16_SURROGATE_OFFSET ((0xd800<<10UL)+0xdc00-0x10000)
    #define U16_GET_SUPPLEMENTARY(lead, trail) \
    #define U16_LEAD(supplementary) (UChar)(((supplementary)>>10)+0xd7c0)
    #define U16_TRAIL(supplementary) (UChar)(((supplementary)&0x3ff)|0xdc00)
    #define U16_LENGTH(c) ((uint32_t)(c)<=0xffff ? 1 : 2)

In fact, you might want to read over that file, as it has embedded documentation
for these, and has other macros for being careful about surrogates.  For example,
here's one in full:

    /**
     * Get a code point from a string at a random-access offset,
     * without changing the offset.
     * "Unsafe" macro, assumes well-formed UTF-16.
     *
     * The offset may point to either the lead or trail surrogate unit
     * for a supplementary code point, in which case the macro will read
     * the adjacent matching surrogate as well.
     * The result is undefined if the offset points to a single, unpaired surrogate.
     * Iteration through a string is more efficient with U16_NEXT_UNSAFE or U16_NEXT.
     *
     * @param s const UChar * string
     * @param i string offset
     * @param c output UChar32 variable
     * @see U16_GET
     * @stable ICU 2.4
     */
    #define U16_GET_UNSAFE(s, i, c) { \
	(c)=(s)[i]; \
	if(U16_IS_SURROGATE(c)) { \
	    if(U16_IS_SURROGATE_LEAD(c)) { \
		(c)=U16_GET_SUPPLEMENTARY((c), (s)[(i)+1]); \
	    } else { \
		(c)=U16_GET_SUPPLEMENTARY((s)[(i)-1], (c)); \
	    } \
	} \
    }

So keeping your preamble bits, I might have considered doing it
this way if it were me doing it:

    #define _Py_UNICODE_IS_SURROGATE
    #define _Py_UNICODE_IS_LEAD_SURROGATE
    #define _Py_UNICODE_IS_TRAIL_SURROGATE
    #define _Py_UNICODE_JOIN_SURROGATES

But I also come from a culture that uses more underscores than you guys tend 
to, as shown in some of the macro names shown below from utf8.h file.  I find
that most projects use more underscores in uppercase names than Python does. :)

--tom

#define UTF_START_MARK(len) (((len) >  7) ? 0xFF : (0xFE << (7-(len))))
#define UTF_START_MASK(len) (((len) >= 7) ? 0x00 : (0x1F >> ((len)-2)))
#define UTF_CONTINUATION_MARK           0x80
#define UTF_ACCUMULATION_SHIFT          6
#define UTF_CONTINUATION_MASK           ((U8)0x3f)
#define UNISKIP(uv) ( (uv) < 0x80           ? 1 : \
#define UNISKIP(uv) ( (uv) < 0x80           ? 1 : \
#define NATIVE_IS_INVARIANT(c)          UNI_IS_INVARIANT(NATIVE8_TO_UNI(c))
#define IN_BYTES (CopHINTS_get(PL_curcop) & HINT_BYTES)
#define UNICODE_SURROGATE_FIRST         0xD800
#define UNICODE_SURROGATE_LAST          0xDFFF
#define UNICODE_REPLACEMENT             0xFFFD
#define UNICODE_BYTE_ORDER_MARK         0xFEFF
#define PERL_UNICODE_MAX        0x10FFFF
#define UNICODE_WARN_SURROGATE     0x0001       /* UTF-16 surrogates */
#define UNICODE_WARN_NONCHAR       0x0002       /* Non-char code points */
#define UNICODE_WARN_SUPER         0x0004       /* Above 0x10FFFF */
#define UNICODE_WARN_FE_FF         0x0008       
#define UNICODE_DISALLOW_SURROGATE 0x0010
#define UNICODE_DISALLOW_NONCHAR   0x0020
#define UNICODE_DISALLOW_SUPER     0x0040
#define UNICODE_DISALLOW_FE_FF     0x0080
#define UNICODE_WARN_ILLEGAL_INTERCHANGE \
#define UNICODE_DISALLOW_ILLEGAL_INTERCHANGE \
#define UNICODE_ALLOW_SURROGATE 0
#define UNICODE_ALLOW_SUPER     0
#define UNICODE_ALLOW_ANY       0
#define UNICODE_IS_SURROGATE(c)         ((c) >= UNICODE_SURROGATE_FIRST && \
#define UNICODE_IS_REPLACEMENT(c)       ((c) == UNICODE_REPLACEMENT)
#define UNICODE_IS_BYTE_ORDER_MARK(c)   ((c) == UNICODE_BYTE_ORDER_MARK)
#define UNICODE_IS_NONCHAR(c)           ((c >= 0xFDD0 && c <= 0xFDEF) \
#define UNICODE_IS_SUPER(c)             ((c) > PERL_UNICODE_MAX)
#define UNICODE_IS_FE_FF(c)             ((c) > 0x7FFFFFFF)
#define UNICODE_GREEK_CAPITAL_LETTER_SIGMA      0x03A3
#define UNICODE_GREEK_SMALL_LETTER_FINAL_SIGMA  0x03C2
#define UNICODE_GREEK_SMALL_LETTER_SIGMA        0x03C3
#define GREEK_SMALL_LETTER_MU                   0x03BC
#define GREEK_CAPITAL_LETTER_MU 0x039C  /* Upper and title case of MICRON */
#define LATIN_CAPITAL_LETTER_Y_WITH_DIAERESIS 0x0178    /* Also is title case */
#define LATIN_CAPITAL_LETTER_SHARP_S    0x1E9E
#define UNI_DISPLAY_ISPRINT     0x0001
#define UNI_DISPLAY_BACKSLASH   0x0002
#define UNI_DISPLAY_QQ          (UNI_DISPLAY_ISPRINT|UNI_DISPLAY_BACKSLASH)
#define UNI_DISPLAY_REGEX       (UNI_DISPLAY_ISPRINT|UNI_DISPLAY_BACKSLASH)
#define LATIN_SMALL_LETTER_SHARP_S      0x00DF
#define LATIN_SMALL_LETTER_Y_WITH_DIAERESIS 0x00FF
#define MICRO_SIGN 0x00B5
#define LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE 0x00C5
#define LATIN_SMALL_LETTER_A_WITH_RING_ABOVE 0x00E5
#define ANYOF_FOLD_SHARP_S(node, input, end)    \
#define SHARP_S_SKIP 2

PS: Those won't always make sense for lack of continuation lines and enclosing ifdefs.
History
Date User Action Args
2011-08-16 11:04:43tchristsetrecipients: + tchrist, lemburg, loewis, doerwalter, georg.brandl, rhettinger, amaury.forgeotdarc, belopolsky, Rhamphoryncus, pitrou, vstinner, eric.smith, stutzbach, ezio.melotti
2011-08-16 11:04:42tchristlinkissue10542 messages
2011-08-16 11:04:40tchristcreate