classification
Title: Use macros for surrogates in unicodeobject.c
Type: Stage:
Components: Versions: Python 3.3
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Py_UNICODE_NEXT and other macros for surrogates
View: 10542
Assigned To: Nosy List: belopolsky, benjamin.peterson, ezio.melotti, lemburg, loewis, pitrou, tchrist, terry.reedy, vstinner
Priority: normal Keywords: patch

Created on 2011-08-15 10:06 by vstinner, last changed 2011-08-16 13:22 by benjamin.peterson. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_macros.patch vstinner, 2011-08-15 10:06 review
Messages (7)
msg142108 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-08-15 10:06
A lot of code is duplicated in unicodeobject.c to manipulate ("encode/decode") surrogates. Each function has from one to three different implementations. The new decode_ucs4() function adds a new implementation. Attached patch replaces this code by macros.

I think that only the implementations of IS_HIGH_SURROGATE and IS_LOW_SURROGATE are important for speed. ((ch & 0xFFFFFC00UL) == 0xD800) (from decode_ucs4) is *a little bit* faster than (0xD800 <= ch && ch <= 0xDBFF) on my CPU (Atom Z520 @ 1.3 GHz): running test_unicode 4 times takes ~54 sec instead of ~57 sec (-3%).

These 3 macros have to be checked, I wrote the first one:

#define IS_SURROGATE(ch) (((ch) & 0xFFFFF800UL) == 0xD800)
#define IS_HIGH_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xD800)
#define IS_LOW_SURROGATE(ch) (((ch) & 0xFFFFFC00UL) == 0xDC00)

I added cast to Py_UCS4 in COMBINE_SURROGATES to avoid integer overflow if Py_UNICODE is 16 bits (narrow build). It's maybe useless.

#define COMBINE_SURROGATES(ch1, ch2) \
 (((((Py_UCS4)(ch1) & 0x3FF) << 10) | ((Py_UCS4)(ch2) & 0x3FF)) + 0x10000)

HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument has been preproceed to fit in [0; 0xFFFF]. I added this requirement in the comment of these macros. It would be better to have only one macro to do the two operations, but because "*p++" (dereference and increment) is usually used, I prefer to avoid one unique macro (I don't like passing *p++ in a macro using its argument more than once).

Or we may add a third macro using HIGH_SURROGATE and LOW_SURROGATE.

I rewrote the main loop of PyUnicode_EncodeUTF16() to avoid an useless test on ch2 on narrow build.

I also added a IS_NONBMP macro just because I prefer macro over hardcoded constants.
msg142109 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-08-15 10:10
We may use the following unlikely macro for IS_SURROGATE, IS_HIGH_SURROGATE and IS_LOW_SURROGATE:

#define likely(x)	__builtin_expect(!!(x), 1)
#define unlikely(x)	__builtin_expect(!!(x), 0)

I suppose that we should use microbenchmarks to validate these macros?

Should I open a new issue for this idea?
msg142111 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-15 10:23
This has been proposed already in #10542 (the issue also has patches).
msg142125 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-15 12:11
> HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument
> has been preproceed to fit in [0; 0xFFFF]. I added this requirement in
> the comment of these macros.

The macros should preprocess the argument themselves. It will make the
code even simpler.
Otherwise +1.
msg142148 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-08-15 21:44
> This has been proposed already in #10542 (the issue also has patches).

The two issues are different: this issue is only a refactoring, whereas #10542 adds a new "feature" (function/macro: Py_UNICODE_NEXT).
msg142170 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-16 08:19
#10542 proposes the following macros to factor out common code:
 #define _Py_UNICODE_ISSURROGATE
 #define _Py_UNICODE_ISHIGHSURROGATE
 #define _Py_UNICODE_ISLOWSURROGATE
 #define _Py_UNICODE_JOIN_SURROGATES
and to avoid checking for narrow/wide builds and recombine surrogates manually (so still refactoring):
 #define _Py_UNICODE_NEXT
 #define _Py_UNICODE_PUT_NEXT

Your patch proposes the same 4 macros:
 #define IS_SURROGATE
 #define IS_HIGH_SURROGATE
 #define IS_LOW_SURROGATE
 #define COMBINE_SURROGATES
+ 3 additional macros:
 #define IS_NONBMP
 #define HIGH_SURROGATE
 #define LOW_SURROGATE

So the two issue looks quite similar to me.
msg142174 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-08-16 08:31
Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> #10542 proposes the following macros to factor out common code:
>  #define _Py_UNICODE_ISSURROGATE
>  #define _Py_UNICODE_ISHIGHSURROGATE
>  #define _Py_UNICODE_ISLOWSURROGATE
>  #define _Py_UNICODE_JOIN_SURROGATES
> and to avoid checking for narrow/wide builds and recombine surrogates manually (so still refactoring):
>  #define _Py_UNICODE_NEXT
>  #define _Py_UNICODE_PUT_NEXT
> 
> Your patch proposes the same 4 macros:
>  #define IS_SURROGATE
>  #define IS_HIGH_SURROGATE
>  #define IS_LOW_SURROGATE
>  #define COMBINE_SURROGATES
> + 3 additional macros:
>  #define IS_NONBMP
>  #define HIGH_SURROGATE
>  #define LOW_SURROGATE
> 
> So the two issue looks quite similar to me.

Can we please fold this issue into #10542. We've already had the
discussion there.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________
2011-10-04: PyCon DE 2011, Leipzig, Germany                49 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
History
Date User Action Args
2011-08-16 13:22:49benjamin.petersonsetstatus: open -> closed
resolution: duplicate
superseder: Py_UNICODE_NEXT and other macros for surrogates
2011-08-16 08:31:38lemburgsetmessages: + msg142174
2011-08-16 08:19:06ezio.melottisetmessages: + msg142170
2011-08-15 21:44:59vstinnersetmessages: + msg142148
2011-08-15 21:39:51vstinnersetnosy: + belopolsky
2011-08-15 12:11:28pitrousetmessages: + msg142125
2011-08-15 10:23:32ezio.melottisetmessages: + msg142111
2011-08-15 10:10:08vstinnersetmessages: + msg142109
2011-08-15 10:06:24vstinnercreate