Message 123403 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	belopolsky
Recipients	belopolsky, eric.smith, ezio.melotti, lemburg, mark.dickinson, skrah
Date	2010-12-04.21:03:17
SpamBayes Score	5.551115e-17
Marked as misclassified	No
Message-id	<AANLkTi=BjOg2-LdrPp6fiGin9SE89ST+J92o2Q4sof5u@mail.gmail.com>
In-reply-to	<1291493516.72.0.216748407371.issue10557@psf.upfronthosting.co.za>

Content
On Sat, Dec 4, 2010 at 3:11 PM, Mark Dickinson <report@bugs.python.org> wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: >.. One issue is that we'd still need the char* -> double operations, partly because > PyOS_string_to_double is part of the public API, and partly to continue to support > creation of a float from a bytes instance. > I thought about it. I see two solutions: 1. Retain PyOS_string_to_double unchanged and add PyOS_unicode_to_double. 2. Replace PyOS_string_to_double with UTF-8 decode result passed to PyOS_unicode_to_double. > The other issue is that for floats, it's difficult to separate the parser from the base > conversion; to be useful, we'd probably end up making the whole of dtoa.c > Py_UNICODE aware. That's what I had in mind. Naively it looks like we just need to replace char type with Py_UNICODE in several places. Assuming exotic digit conversion is still handled separately. > (One of the return values from the dtoa.c parser is a pointer to the significant digits > in the original input string; so the base-conversion calculation itself needs access > to portions of the original string.) > Maybe we should start with int(). It is simpler, but probably reveal some of the same difficulties as float() > Ideally, for float(string), we'd have a zero-copy setup that operated directly on the > unicode input (read-only); but I think that achieving that right now is going to be > messy, and involve dtoa.c knowing far more about Unicode that I'd be comfortable > with. > This is clearly a 3.3-ish project. Hopefully in time people will realize that decimal digits are just [0-9] and numeric experts will not be required to know about Unicode beyond 127th code point. :-) > N.B. If we didn't have to deal with alternative digits, it really would be much simpler. > We still don't. I've already separated this out and we can keep it this way as long as people are willing to pay the price for alternative digits' support. One thing we may improve, is to fail earlier on non-digits in PyUnicode_TransformDecimalToASCII() to speedup not uncommon code like this: for line in f: try: n = int(lint) except ValueError: pass ... > Perhaps a compromise option is available, that does a preliminary pass on the > Unicode string and only makes a copy if non-European digits are discovered. Hmm. That would require changing the signature of PyUnicode_TransformDecimalToASCII() to take PyObject* instead of the buffer. I knew we shouldn't have rushed to make it public. We can still do it in longobject.c and friends' boilerplate.

On Sat, Dec 4, 2010 at 3:11 PM, Mark Dickinson <report@bugs.python.org> wrote:
>
> Mark Dickinson <dickinsm@gmail.com> added the comment:
>.. One issue is that we'd still need the char* -> double operations, partly because
> PyOS_string_to_double is part of the public API, and partly to continue to support
> creation of a float from a bytes instance.
>

I thought about it.  I see two solutions:

1. Retain PyOS_string_to_double unchanged and add PyOS_unicode_to_double.
2. Replace PyOS_string_to_double with UTF-8 decode result passed to
PyOS_unicode_to_double.

> The other issue is that for floats, it's difficult to separate the parser from the base
> conversion;  to be useful, we'd probably end up making the whole of dtoa.c
> Py_UNICODE aware.

That's what I had in mind.  Naively it looks like we just need to
replace char type with Py_UNICODE in several places.  Assuming exotic
digit conversion is still handled separately.

>  (One of the return values from the dtoa.c parser is a pointer to the significant digits
> in the original input string;  so the base-conversion calculation itself needs access
> to portions of the original string.)
>

Maybe we should start with int().  It is simpler, but probably reveal
some of the same difficulties as float()

> Ideally, for float(string), we'd have a zero-copy setup that operated directly on the
> unicode input (read-only);  but I think that achieving that right now is going to be
> messy, and involve dtoa.c knowing far more about Unicode that I'd be comfortable
> with.
>

This is clearly a 3.3-ish project.  Hopefully in time people will
realize that decimal digits are just [0-9] and numeric experts will
not be required to know about Unicode beyond 127th code point. :-)

> N.B. If we didn't have to deal with alternative digits, it *really* would be much simpler.
>

We still don't.  I've already separated this out and we can keep it
this way as long as people are willing to pay the price for
alternative digits' support.

One thing we may improve, is to fail earlier on non-digits in
PyUnicode_TransformDecimalToASCII()  to speedup not uncommon code like
this:

for line in f:
   try:
       n = int(lint)
   except ValueError:
       pass
   ...

> Perhaps a compromise option is available, that does a preliminary pass on the
> Unicode string and only makes a copy if non-European digits are discovered.

Hmm.  That would require changing the signature of
PyUnicode_TransformDecimalToASCII() to take PyObject* instead of the
buffer.  I knew we shouldn't have rushed to make it public.  We can
still do it in longobject.c and friends' boilerplate.

History
Date	User	Action	Args
2010-12-04 21:03:19	belopolsky	set	recipients: + belopolsky, lemburg, mark.dickinson, eric.smith, ezio.melotti, skrah
2010-12-04 21:03:18	belopolsky	link	issue10557 messages
2010-12-04 21:03:17	belopolsky	create