Message 183446 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	ezio.melotti, lemburg, serhiy.storchaka, vstinner, wiml
Date	2013-03-04.13:20:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1362403233.83.0.893127781964.issue15866@psf.upfronthosting.co.za>
In-reply-to

Content
I prefer a little different (simpler for me) form: for (p = collstart; p < collend;) { Py_UCS4 ch = p++; if ((0xD800 <= ch && ch <= 0xDBFF) && (p < collend) && (0xDC00 <= p && p <= 0xDFFF)) { ch = ((((ch & 0x03FF) << 10) \| ((Py_UCS4)p++ & 0x03FF)) + 0x10000); } str += sprintf(str, "&#%d;", (int)ch); } And please look at the loop above ("determine replacement size"). It should be corrected too. It will be simpler to use a buffer with static size (``char buffer[2+29+1+1];``) as in charmap encoder. Perhaps charmap encoder should be fixed too (and common code extracted to separate function). I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as b'💝' and not as b'&#55357;&#56477;'? This will be compatible with narrow build but will break compatibility with 3.3+. What is less evil?

I prefer a little different (simpler for me) form:

                for (p = collstart; p < collend;) {
                    Py_UCS4 ch = *p++;
                    if ((0xD800 <= ch && ch <= 0xDBFF) &&
                        (p < collend) &&
                        (0xDC00 <= *p && *p <= 0xDFFF)) {
                        ch = ((((ch & 0x03FF) << 10) |
                               ((Py_UCS4)*p++ & 0x03FF)) + 0x10000);
                    }
                    str += sprintf(str, "&#%d;", (int)ch);
                }

And please look at the loop above ("determine replacement size"). It should be corrected too. It will be simpler to use a buffer with static size (``char buffer[2+29+1+1];``) as in charmap encoder. Perhaps charmap encoder should be fixed too (and common code extracted to separate function).

I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as b'&#128157;' and not as b'&#55357;&#56477;'? This will be compatible with narrow build but will break compatibility with 3.3+. What is less evil?

History
Date	User	Action	Args
2013-03-04 13:20:33	serhiy.storchaka	set	recipients: + serhiy.storchaka, lemburg, vstinner, ezio.melotti, wiml
2013-03-04 13:20:33	serhiy.storchaka	set	messageid: <1362403233.83.0.893127781964.issue15866@psf.upfronthosting.co.za>
2013-03-04 13:20:33	serhiy.storchaka	link	issue15866 messages
2013-03-04 13:20:33	serhiy.storchaka	create