Message 124860 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	belopolsky
Recipients	Rhamphoryncus, amaury.forgeotdarc, belopolsky, doerwalter, eric.smith, ezio.melotti, georg.brandl, lemburg, loewis, pitrou, rhettinger, stutzbach, vstinner
Date	2010-12-29.17:36:15
SpamBayes Score	5.551115e-17
Marked as misclassified	No
Message-id	<AANLkTinUhFG=mihzLv+xfqX2xrpc-ke1Gq-8wYCqt6pT@mail.gmail.com>
In-reply-to	<4D1B2767.30209@egenix.com>

Content
On Wed, Dec 29, 2010 at 7:19 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > * The macros still need some more attention to enhance their performance. > Although I made your suggested change from '-' to '&', I seriously doubt that this would make any difference on modern CPUs. Why do you think these macros are performance critical? Users with lots of supplementary characters in their files are probably better off with a wide build where Py_UNICODE_NEXT() is just ptr++ and can hardly be further optimized. Higher performance algorithms are possible, but those should probably do some loop unrolling and/or process more than one character at a time. At this point, however it is too soon to worry about optimization before we even know where these macros will be used. > For consistency, I'd choose names Py_UNICODE_READ_NEXT() > and Py_UNICODE_WRITE_NEXT() instead of Py_UNICODE_NEXT() and > Py_UNICODE_PUT_NEXT(). > I would leave it for you and Raymond to reach a consensus. My understanding is that Raymond does not want "next" in the name, so your suggestion still conflicts with that. I would mildly prefer GET/PUT over READ/WRITE because the latter suggests multiple characters. As discussed before, the macro prefix does not imply the return value. Compare this to Py_UNICODE_ISSPACE() and friends or pretty much any other Py_UNICODE_ macro. Note that I added a leading underscore to Py_UNICODE_JOIN_SURROGATES and other new macros, so there is no immediate pressure to get the names perfect. > * The macros need to be carefully documented, both in unicodeobject.h > and the general docs. > I've added a description above _Py_UNICODE_NEXT macros. I would really like to see these macros in private use for a while before they are published for general audience. I'll add a comment describing _Py_UNICODE_JOIN_SURROGATES. The remaining macros seem to be fairly self-explanatory (unlike, say Py_UNICODE_ISDIGIT or Py_UNICODE_ISTITLE which are not documented in unicodeobject.h.) Explicit downcasts would probably make sense, for example (ptr)++ = (Py_UNICODE)ch instead of (ptr)++ = ch, but I don't think we need explicit casts say in Py_UCS4 code = (ch) - 0x10000; where they can mask coding errors. I also looked for the use of casts elsewhere in unicodeobject.h and the following does not look right: #define Py_UNICODE_ISSPACE(ch) \ ((ch) < 128U ? _Py_ascii_whitespace[(ch)] : _PyUnicode_IsWhitespace(ch)) It looks like this won't work right if ch is a signed char. > Same for your _Py_UNICODE_NEXT() to make sure that the return > value is indeed a Py_UNICODE value. > The return value of _Py_UNICODE_NEXT() is not Py_UNICODE, it is Py_UCS4 and as far as I can see, every conditional branch in narrow case has an explicit cast. In the wide case, I don't think we want an explicit cast because ptr should already be Py_UCS4* and if it is not, it may be a coding error that we don't want to mask. > * In general, we should probably be clear on the allowed input > and define the output types in the documentation. I agree. I'll add a note that ptr and end should be Py_UNICODE*. I am not sure what we should say about ch argument. If we add casts, the macro will accept anything, but we should probably document it as expecting Py_UCS4.

On Wed, Dec 29, 2010 at 7:19 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
> * The macros still need some more attention to enhance their performance.
>
Although I made your suggested change from '-' to '&', I seriously
doubt that this would make any difference on modern CPUs.  Why do you
think these macros are performance critical?  Users with lots of
supplementary characters in their files are probably better off with a
wide build where Py_UNICODE_NEXT() is just *ptr++ and can hardly be
further optimized.  Higher performance algorithms are possible, but
those should probably do some loop unrolling and/or process more than
one character at a time.  At this point, however it is too soon to
worry about optimization before we even know where these macros will
be used.

> * For consistency, I'd choose names Py_UNICODE_READ_NEXT()
>  and Py_UNICODE_WRITE_NEXT() instead of Py_UNICODE_NEXT() and
>  Py_UNICODE_PUT_NEXT().
>

I would leave it for you and Raymond to reach a consensus.  My
understanding is that Raymond does not want "next" in the name, so
your suggestion still conflicts with that.  I would mildly prefer
GET/PUT over READ/WRITE because the latter suggests multiple
characters.

As discussed before, the macro prefix does not imply the return value.
 Compare this to Py_UNICODE_ISSPACE() and friends or pretty much any
other Py_UNICODE_ macro.   Note that I added a leading underscore to
Py_UNICODE_JOIN_SURROGATES and other new macros, so there is no
immediate pressure to get the names perfect.

> * The macros need to be carefully documented, both in unicodeobject.h
>  and the general docs.
>

I've added a description above _Py_UNICODE_*NEXT macros.  I would
really like to see these macros in private use for a while before they
are published for general audience.  I'll add a comment describing
_Py_UNICODE_JOIN_SURROGATES.  The remaining macros seem to be fairly
self-explanatory (unlike, say Py_UNICODE_ISDIGIT or Py_UNICODE_ISTITLE
which are not documented in unicodeobject.h.)

Explicit downcasts would probably make sense, for example *(ptr)++ =
(Py_UNICODE)ch instead of *(ptr)++ = ch, but I don't think we need
explicit casts say in Py_UCS4 code = (ch) - 0x10000; where they can
mask coding errors.

I also looked for the use of casts elsewhere in unicodeobject.h and
the following does not look right:

#define Py_UNICODE_ISSPACE(ch) \
    ((ch) < 128U ? _Py_ascii_whitespace[(ch)] : _PyUnicode_IsWhitespace(ch))

It looks like this won't work right if ch is a signed char.

> * Same for your _Py_UNICODE_NEXT() to make sure that the return
>  value is indeed a Py_UNICODE value.
>

The return value of _Py_UNICODE_NEXT()  is *not* Py_UNICODE, it is
Py_UCS4 and as far as I can see, every conditional branch in narrow
case has an explicit cast.  In the wide case, I don't think we want an
explicit cast because ptr should already be Py_UCS4* and if it is not,
it may be a coding error that we don't want to mask.

> * In general, we should probably be clear on the allowed input
>  and define the output types in the documentation.

I agree.  I'll add a note that ptr and end should be Py_UNICODE*.  I
am not sure what we should say about ch argument.  If we add casts,
the macro will accept anything, but we should probably document it as
expecting Py_UCS4.

History
Date	User	Action	Args
2010-12-29 17:36:17	belopolsky	set	recipients: + belopolsky, lemburg, loewis, doerwalter, georg.brandl, rhettinger, amaury.forgeotdarc, Rhamphoryncus, pitrou, vstinner, eric.smith, stutzbach, ezio.melotti
2010-12-29 17:36:15	belopolsky	link	issue10542 messages
2010-12-29 17:36:15	belopolsky	create