Message 123757 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	stutzbach
Recipients	Rhamphoryncus, amaury.forgeotdarc, belopolsky, eric.smith, ezio.melotti, lemburg, loewis, pitrou, rhettinger, stutzbach, vstinner
Date	2010-12-10.23:09:24
SpamBayes Score	3.669287e-14
Marked as misclassified	No
Message-id	<1292022566.21.0.51776102591.issue10542@psf.upfronthosting.co.za>
In-reply-to

Content
In bltinmodule.c, it looks like some of the indentation doesn't line up? Bikeshedding aside, it looks good to me. I agree with Eric Smith that the first part macro name usually refers to the type of the first argument (or the type the first argument points to). Examples: - Py_UNICODE_ISSPACE : Py_UNICODE -> int - Py_UNICODE_TOLOWER : Py_UNICODE -> Py_UNICODE - Py_UNICODE_strlen: Py_UNICODE * -> size_t This is true elsewhere in the code as well: - PyList_GET_SIZE : PyListObject * -> Py_ssize_t Yes, I know there are some unfortunate exceptions. ;-) I agree that it would be nice if something in the name hinted that the return type was Py_UCS4 though. Marc-Andre Lemburg wrote: > The first argument of the macro can be any array, not just > Py_UNICODE, but also Py_UCS4 or even int. It's true that macros in C do not have any type safety. While technically passing a Py_UCS4 will work, on a UCS2 build it would needlessly check the Py_UCS4 data for surrogates. I think we should discourage that. You can also technically pass a PyListObject * to PyTuple_GET_SIZE, but that's also not a good idea. ;-) Alexander Belopolsky wrote: > The issue is that once in in the process of reading the codepoint, it > is determined whether the code point is BMP or non-BMP. Testing the > result again in order to write it is somewhat wasteful. I don't > think this would matter in practice, but would like to hear > alternative opinions before moving further. If the common pattern is: ch = Py_UNICODE_NEXT(rp, end); uc = Py_UNICODE_SOME_TRANSFORMATION(ch); Py_UNICODE_PUT_NEXT(wp, uc); The second check for surrogates in Py_UNICODE_PUT_NEXT is necessary, unless you can prove that Py_UNICODE_SOME_TRANSFORMATION will never transform characters < 0x10000 into characters > 0x10000 or vice versa. Can we prove will always be the case, for current and future versions of Unicode, for all or almost-all of the transformations we care about? Answering that question and figuring out what to do about it are probably more trouble than it's worth. If a particularly point proves to be a bottleneck, we can always specialize the code there later.

In bltinmodule.c, it looks like some of the indentation doesn't line up?

Bikeshedding aside, it looks good to me.

I agree with Eric Smith that the first part macro name usually refers to the type of the first argument (or the type the first argument points to).  Examples:

 - Py_UNICODE_ISSPACE : Py_UNICODE -> int
 - Py_UNICODE_TOLOWER : Py_UNICODE -> Py_UNICODE
 - Py_UNICODE_strlen: Py_UNICODE * -> size_t

This is true elsewhere in the code as well:

 - PyList_GET_SIZE : PyListObject * -> Py_ssize_t

Yes, I know there are some unfortunate exceptions. ;-)

I agree that it would be nice if something in the name hinted that the return type was Py_UCS4 though.

Marc-Andre Lemburg wrote:
>  The first argument of the macro can be any array, not just
> Py_UNICODE*, but also Py_UCS4* or even int*.

It's true that macros in C do not have any type safety.  While technically passing a Py_UCS4 * will work, on a UCS2 build it would needlessly check the Py_UCS4 data for surrogates.  I think we should discourage that.

You can also technically pass a PyListObject * to PyTuple_GET_SIZE, but that's also not a good idea. ;-)

Alexander Belopolsky wrote:
> The issue is that once in in the process of reading the codepoint, it
> is determined whether the code point is BMP or non-BMP.  Testing the
> result again in order to write it is somewhat wasteful.  I don't
> think this would matter in practice, but would like to hear
> alternative opinions before moving further.

If the common pattern is:

         ch = Py_UNICODE_NEXT(rp, end);
         uc = Py_UNICODE_SOME_TRANSFORMATION(ch);
         Py_UNICODE_PUT_NEXT(wp, uc);

The second check for surrogates in Py_UNICODE_PUT_NEXT is necessary, unless you can prove that Py_UNICODE_SOME_TRANSFORMATION will never transform characters < 0x10000 into characters > 0x10000 or vice versa.  

Can we prove will always be the case, for current and future versions of Unicode, for all or almost-all of the transformations we care about?

Answering that question and figuring out what to do about it are probably more trouble than it's worth.  If a particularly point proves to be a bottleneck, we can always specialize the code there later.

History
Date	User	Action	Args
2010-12-10 23:09:26	stutzbach	set	recipients: + stutzbach, lemburg, loewis, rhettinger, amaury.forgeotdarc, belopolsky, Rhamphoryncus, pitrou, vstinner, eric.smith, ezio.melotti
2010-12-10 23:09:26	stutzbach	set	messageid: <1292022566.21.0.51776102591.issue10542@psf.upfronthosting.co.za>
2010-12-10 23:09:24	stutzbach	link	issue10542 messages
2010-12-10 23:09:24	stutzbach	create