Author belopolsky
Recipients belopolsky, eric.smith, ezio.melotti, lemburg, mark.dickinson, skrah, vstinner
Date 2010-11-29.15:39:49
SpamBayes Score 1.04555e-12
Marked as misclassified No
Message-id <AANLkTi=Yfy=wpLDwrP2jTFCbsPWbXGjz0pG_mcg_1iL0@mail.gmail.com>
In-reply-to <4CF3754C.4080705@egenix.com>
Content
On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
..
> It would be better to copy and iterate over the Unicode string first,
> replacing any decimal code points with ASCII ones and then call the
> UTF-8 encoder.
>

Good idea.

> The code as it stands is very inefficient, since it will most likely
> run the memcpy() part for every code point after the first non-ASCII
> decimal one.
>

I doubt there are measurable gains from this optimization, but doing
conversion in Unicode characters results in cleaner API.  The new
patch, issue10557a.diff, implements
_PyUnicode_NormalizeDecimal(Py_UNICODE *s, Py_ssize_t length) which is
defined as follows:

/* Strip leading and trailing space and convert code points that have
decimal
   digit property to the corresponding ASCII digit code point.

   Returns a new Unicode string on success, NULL on failure.
*/

Note that I used deprecated _PyUnicode_AsStringAndSize() in
floatobject.c not only because it is convenient, but also because I
believe that in the future numerical value parsers should be converted
to operate on unicode characters.  When this happens, the use of
_PyUnicode_AsStringAndSize() can be removed.
Files
File name Uploaded
issue10557a.diff belopolsky, 2010-11-29.15:39:49
History
Date User Action Args
2010-11-29 15:39:52belopolskysetrecipients: + belopolsky, lemburg, mark.dickinson, vstinner, eric.smith, ezio.melotti, skrah
2010-11-29 15:39:49belopolskylinkissue10557 messages
2010-11-29 15:39:49belopolskycreate