Message 122816 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	belopolsky
Recipients	belopolsky, eric.smith, ezio.melotti, lemburg, mark.dickinson, skrah, vstinner
Date	2010-11-29.15:39:49
SpamBayes Score	1.0455525e-12
Marked as misclassified	No
Message-id	<AANLkTi=Yfy=wpLDwrP2jTFCbsPWbXGjz0pG_mcg_1iL0@mail.gmail.com>
In-reply-to	<4CF3754C.4080705@egenix.com>

Content
On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote: .. > It would be better to copy and iterate over the Unicode string first, > replacing any decimal code points with ASCII ones and then call the > UTF-8 encoder. > Good idea. > The code as it stands is very inefficient, since it will most likely > run the memcpy() part for every code point after the first non-ASCII > decimal one. > I doubt there are measurable gains from this optimization, but doing conversion in Unicode characters results in cleaner API. The new patch, issue10557a.diff, implements _PyUnicode_NormalizeDecimal(Py_UNICODE s, Py_ssize_t length) which is defined as follows: / Strip leading and trailing space and convert code points that have decimal digit property to the corresponding ASCII digit code point. Returns a new Unicode string on success, NULL on failure. */ Note that I used deprecated _PyUnicode_AsStringAndSize() in floatobject.c not only because it is convenient, but also because I believe that in the future numerical value parsers should be converted to operate on unicode characters. When this happens, the use of _PyUnicode_AsStringAndSize() can be removed.

On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
> It would be better to copy and iterate over the Unicode string first,
> replacing any decimal code points with ASCII ones and then call the
> UTF-8 encoder.
>

Good idea.

> The code as it stands is very inefficient, since it will most likely
> run the memcpy() part for every code point after the first non-ASCII
> decimal one.
>

I doubt there are measurable gains from this optimization, but doing
conversion in Unicode characters results in cleaner API.  The new
patch, issue10557a.diff, implements
_PyUnicode_NormalizeDecimal(Py_UNICODE *s, Py_ssize_t length) which is
defined as follows:

/* Strip leading and trailing space and convert code points that have
decimal
   digit property to the corresponding ASCII digit code point.

   Returns a new Unicode string on success, NULL on failure.
*/

Note that I used deprecated _PyUnicode_AsStringAndSize() in
floatobject.c not only because it is convenient, but also because I
believe that in the future numerical value parsers should be converted
to operate on unicode characters.  When this happens, the use of
_PyUnicode_AsStringAndSize() can be removed.

Files
File name	Uploaded
issue10557a.diff	belopolsky, 2010-11-29.15:39:49

History
Date	User	Action	Args
2010-11-29 15:39:52	belopolsky	set	recipients: + belopolsky, lemburg, mark.dickinson, vstinner, eric.smith, ezio.melotti, skrah
2010-11-29 15:39:49	belopolsky	link	issue10557 messages
2010-11-29 15:39:49	belopolsky	create