Message 122564 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Rhamphoryncus, amaury.forgeotdarc, belopolsky, eric.smith, ezio.melotti, lemburg, loewis, pitrou, rhettinger, vstinner
Date	2010-11-27.22:03:18
SpamBayes Score	2.0718427e-12
Marked as misclassified	No
Message-id	<1290895400.38.0.59035357378.issue10542@psf.upfronthosting.co.za>
In-reply-to

Content
I like the idea and thanks for putting work into this. Some comments: * when using macro variables, always put the variables in parens in the expansion; this avoids precedence issues, weird syntax errors, etc. - even if it may not be necessary * a function would be cleaner, but since this code is very performance sensitive, I'd opt for the macro version, unless someone can prove that a function would be just as fast in benchmarks * the macros should be documented in the unicodeobject.h header file and clearly mention that ptr and end should be side-effect free and that ptr must an lvalue * please use the faster bitmask operators for joining surrogates, i.e. ucs4 = ((((high & 0x03FF) << 10) \| (low & 0x03FF)) + 0x00010000); * the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES() * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT() * in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-) * this version should be slightly faster and is also easier to read: #define Py_UCS4_READ_CODE_POINT(ptr, end) \ ((Py_UNICODE_ISHIGHSURROGATE((ptr)[0]) && \ (ptr) < (end) && \ Py_UNICODE_ISLOWSURROGATE((ptr)[1])) ? \ Py_UNICODE_JOIN_SURROGATES((ptr)++, (ptr)++) : \ (Py_UCS4)*(ptr)++) I haven't tested it, but you get the idea. BTW: You only focus on UCS2 builds. Please also make sure that these changes work on UCS4 builds, e.g. Py_UCS2_READ_CODE_POINT() will also work on UCS4 builds and join code points there. Note that UCS4 builds currently don't join surrogates, so a high and low surrogates appear as two code points, which they are, but given the experience with UCS2 builds, may not be what the user expects. So for the purpose of consistency we should be careful with auto-joining surrogates in UCS2. It does make sence for ord() and the various string methods, but should be done with care in other cases. In any case, we should clearly document where these macros are used and warn about the implications of using them in the wrong places.

I like the idea and thanks for putting work into this.

Some comments:
 
 * when using macro variables, always put the variables in parens in the expansion; this avoids precedence issues, weird syntax errors, etc. - even if it may not be necessary
 
 * a function would be cleaner, but since this code is very performance sensitive, I'd opt for the macro version, unless someone can prove that a function would be just as fast in benchmarks
 
 * the macros should be documented in the unicodeobject.h header file and clearly mention that ptr and end should be side-effect free and that ptr must an lvalue
 
 * please use the faster bitmask operators for joining surrogates, i.e. ucs4 = ((((high & 0x03FF) << 10) | (low & 0x03FF)) + 0x00010000);

 * the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES()

 * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()
 
 * in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-)

 * this version should be slightly faster and is also easier to read:
 
#define Py_UCS4_READ_CODE_POINT(ptr, end) \
    ((Py_UNICODE_ISHIGHSURROGATE((ptr)[0]) && \
      (ptr) < (end) && \
      Py_UNICODE_ISLOWSURROGATE((ptr)[1])) ? \
      Py_UNICODE_JOIN_SURROGATES((ptr)++, (ptr)++) : \
      (Py_UCS4)*(ptr)++)

   I haven't tested it, but you get the idea.
   
BTW: You only focus on UCS2 builds. Please also make sure that these changes work on UCS4 builds, e.g. Py_UCS2_READ_CODE_POINT() will also work on UCS4 builds and join code points there.

Note that UCS4 builds currently don't join surrogates, so a high and low surrogates appear as two code points, which they are, but given the experience with UCS2 builds, may not be what the user expects. So for the purpose of consistency we should be careful with auto-joining surrogates in UCS2.

It does make sence for ord() and the various string methods, but should be done with care in other cases.

In any case, we should clearly document where these macros are used and warn about the implications of using them in the wrong places.

History
Date	User	Action	Args
2010-11-27 22:03:20	lemburg	set	recipients: + lemburg, loewis, rhettinger, amaury.forgeotdarc, belopolsky, Rhamphoryncus, pitrou, vstinner, eric.smith, ezio.melotti
2010-11-27 22:03:20	lemburg	set	messageid: <1290895400.38.0.59035357378.issue10542@psf.upfronthosting.co.za>
2010-11-27 22:03:19	lemburg	link	issue10542 messages
2010-11-27 22:03:18	lemburg	create