Message 128776 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	eric.smith, ezio.melotti, lemburg, mark.dickinson, ron_adam, vstinner, ysj.ray
Date	2011-02-18.07:36:10
SpamBayes Score	2.220446e-16
Marked as misclassified	No
Message-id	<1298014565.16462.7.camel@marge>
In-reply-to	<1297951632.53.0.32370830023.issue7330@psf.upfronthosting.co.za>

Content
> > It looks like your patch fixes #10829: you should add tests for that, you can just reuse the tests of my patch (attached to #10829). > > Sorry, but I think my patch doesn't fix #10829. Ah ok, so don't add failing tests :-) > > You should also avoid the creation of a temporary unicode object (it can be slow if precision is large) using PySequence_GetSlice(). Py_UNICODE_COPY() does already truncate the string because you can pass an arbitrary length. > > In order to use Py_UNICODE_COPY, I have to create a unicode object with required length first. No you don't. You can copy a substring of the input string with Py_UNICODE_COPY: just pass a smaller length. > > With your patch, "%.200s" truncates the input string to 200 characters, but I think that it should truncate to 200 bytes, as printf does. > > Sorry, I don't understand. The result of PyUnicode_FromFormatV() is a unicode object. Then how to truncate to 200 bytes? You can truncate the input char* on the call to PyUnicode_DecodeUTF8: pass a size smaller than strlen(s). case 's': { /* UTF-8 / const char s = va_arg(count, const char); PyObject str = PyUnicode_DecodeUTF8(s, strlen(s), "replace"); if (!str) goto fail; n += PyUnicode_GET_SIZE(str); /* Remember the str and switch to the next slot / callresult++ = str; break; } I don't know if we should truncate to a number of bytes, or a number of characters. > > I don't like this change because I hate having to compute manually strings length. It should that it would be easier if you format directly strings with width and precision at step 3, instead of doing it at step 4: so you can just read the length of the formatted string, and it avoids having to handle width/precision in two steps (which may be inconsistent :-/). > > Do you mean combine step 3 and step 4 together? Currently step 3 is just to compute the biggest width value and step 4 is to compute exact width and do the real format work. Only by doing real format we can get the exact width of a string. So I have to compute each width twice in both step 3 and step 4. Is combining the two steps in to one a good idea? "Do you mean combine step 3 and step 4 together?" Yes, but I am no more sure that it is the right thing to do. > > Can you add tests for "%.s"? I would like to know if "%.s" is different than "%s" :-) > > Err, '%.s' causes unexpected result both with and without my patch. Maybe it's still another bug? If the fix (always have the same behaviour) is short, it would be nice to include it in your patch.

> > It looks like your patch fixes #10829: you should add tests for that, you can just reuse the tests of my patch (attached to #10829).
> 
> Sorry, but I think my patch doesn't fix #10829.

Ah ok, so don't add failing tests :-)

> > You should also avoid the creation of a temporary unicode object (it can be slow if precision is large) using PySequence_GetSlice(). Py_UNICODE_COPY() does already truncate the string because you can pass an arbitrary length.
> 
> In order to use Py_UNICODE_COPY, I have to create a unicode object with required length first.

No you don't. You can copy a substring of the input string with
Py_UNICODE_COPY: just pass a smaller length.

> > With your patch, "%.200s" truncates the input string to 200 *characters*, but I think that it should truncate to 200 *bytes*, as printf does.
> 
> Sorry, I don't understand. The result of PyUnicode_FromFormatV() is a unicode object. Then how to truncate to 200 *bytes*?

You can truncate the input char* on the call to PyUnicode_DecodeUTF8:
pass a size smaller than strlen(s).

case 's':
{
    /* UTF-8 */
    const char *s = va_arg(count, const char*);
    PyObject *str = PyUnicode_DecodeUTF8(s, strlen(s), "replace");
    if (!str)
        goto fail;
    n += PyUnicode_GET_SIZE(str);
    /* Remember the str and switch to the next slot */
    *callresult++ = str;
    break;
}

I don't know if we should truncate to a number of bytes, or a number of
characters.

> > I don't like this change because I hate having to compute manually strings length. It should that it would be easier if you format directly strings with width and precision at step 3, instead of doing it at step 4: so you can just read the length of the formatted string, and it avoids having to handle width/precision in two steps (which may be inconsistent :-/).
> 
> Do you mean combine step 3 and step 4 together? Currently step 3 is just to compute the biggest width value and step 4 is to compute exact width and do the real format work. Only by doing real format we can get the exact width of a string. So I have to compute each width twice in both step 3 and step 4. Is combining the two steps in to one a good idea?

"Do you mean combine step 3 and step 4 together?"

Yes, but I am no more sure that it is the right thing to do.

> > Can you add tests for "%.s"? I would like to know if "%.s" is different than "%s" :-)
> 
> Err, '%.s' causes unexpected result both with and without my patch. Maybe it's still another bug?

If the fix (always have the same behaviour) is short, it would be nice
to include it in your patch.

History
Date	User	Action	Args
2011-02-18 07:36:12	vstinner	set	recipients: + vstinner, lemburg, mark.dickinson, eric.smith, ron_adam, ezio.melotti, ysj.ray
2011-02-18 07:36:11	vstinner	link	issue7330 messages
2011-02-18 07:36:10	vstinner	create