Message 135368 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Ben.Wolfson
Recipients	Ben.Wolfson, eric.araujo, eric.smith, mark.dickinson
Date	2011-05-06.20:40:15
SpamBayes Score	1.110223e-16
Marked as misclassified	No
Message-id	<1304714416.55.0.30233037943.issue12014@psf.upfronthosting.co.za>
In-reply-to

Content
Here's my use case. I'm writing a python version of the ruby library HighLine for CLI interaction, to be called, uncreatively, PyLine. One of the moderately neat things about the library is that it allows for color information to be embedded in the strings one passes to its methods, so, if h is a HighLine object, you could say: h.say "<%= color('this will be red', :red) %> but this won't be" So I wanted to be able to provide some kind of similar facility and realized that the __getitem__ method supported by format(), along with some __getattribute__ trickery, would work: so if p is a PyLine object, you could say: p.say("{colors.red.bold.on_black[this will be bold with red text on a black background]} but this will be just be regular text") Thus: >>> effectize_string("{colors.red.bold.on_black[this will be bold with red text on a black background]} but this will just be regular text") '\x1b[31m\x1b[1m\x1b[40mthis will be bold with red text on a black background\x1b[0m but this will just be regular text\x1b[0m' Obviously, I'll already have to watch out for stray "]"s in the string passed to the object's __getitem__, so you might think, well, it's not much more work to also have to watch out for stray ":", "!", "}", and "{" (but, oddly I won't need to watch out for match "{" and "}"!). But it's obvious that something here should change. For one thing, as it stands, the documentation is wrong; it is not the case that an index_string can contain any character except ']'. But the documentation describes the way things rationally ought to be; there's a good reason not to allow a ']' in the index_string (and one can see why simplicity suggests not allowing for escapes, though I think that ideally there would be an escaping mechanism). But there's no reason not to allow stray "{", "}", ":", and "!" in the index_string. The only reason it's true at this point that "it doesn't know they're in an index field when it's doing the parsing for ':' or '!'" is that (assuming one takes the grammar in the documentation to be accurate) the parser is written incorrectly. It contains, for instance, incorrect comments (in string_format.h:parse_field): <code> /* Search for the field name. it's terminated by the end of the string, or a ':' or '!' / field_name->ptr = str->ptr; while (str->ptr < str->end) { switch (c = (str->ptr++)) { case ':': case '!': break; default: continue; } break; } </code> (hopefully <code> does the right thing here...) That's the culprit for the mishandling of ":" and "!", but it is simply not the case---again, according to the grammar given in the documentation---that the field name can be delimited this way, in two ways.* And, given that no nested expansion is done in the field_name part of the replacement, there's no real reason to retain the present parsing strategy; none of !, :, {, or } has any semantic significance in this part of of the replacement string, so why should the parsing code treat them specially? Surely, even if you think my use case is not so great, there's value in doing it right. The ":" and "!" problem is not super hard to get around. Witness the following dirty hack: <code> void advance_beyond_field(SubString str) { if (str->ptr > str->end) return; switch (++str->ptr) { case '[': while(str->ptr < str->end && (str->ptr) != ']') str->ptr++; advance_beyond_field(str); break; case '.': while(str->ptr < str->end) switch(++str->ptr) { case ':': case '!': str->ptr--; return; case '[': advance_beyond_field(str); str->ptr--; break; default: continue; } break; default: return; } } </code> Followed by replacing the switch statement as above thus: <code> switch (c = (str->ptr++)) { case '.': case '[': str->ptr -= 2; advance_beyond_field(str); continue; case ':': case '!': break; default: continue; } </code> Of course, there is already in the FieldNameIterator plumbing a more certain mechanism for actually getting the fields out. Then one can do this: >>> "{0[:]}".format({":":4}) '4' >>> "{0[{ : ! }]}".format({"{ : ! }":4}) '4' (One can also pass such formatting-exercising test suites as test_nntplib, test_string, and test_collections.) Though still not this: >>> "{0[{]}".format({"{":4}) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: unmatched '{' in format Even though the stray "{" in the square brackets has no semantic significance, it still gets picked up; the culprit is apparently in MarkupIterator_next, whose initial bracket-detecting while loop is not square-bracket aware. One couldn't just add a "case '[':" to its switch statement because '[' could be a fill character, but since a fill character can only occur after a ':', and a field_name can't* occur after a ':' or a '!' a flag for whether a '[' is significant could presumably get around that. Something like (not even remotely tested): <code> case '[': if (bracket_significant) while(self->str.ptr < self->str.end && self->str.ptr != ']') { self->str.ptr++; } continue; case ':': case '!': bracket_significant = 0; </code> bracket_significant having been initialized to 1. If something like the above works, then it seems to me that it would take a very small benefit to outweigh the effort necessary to do this right. the second way I don't really care about: the grammar identifies an "attribute_name" as an identifier, but in fact any string will work and will be passed to getattr: >>> "{0.4}".format(2) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'int' object has no attribute '4' >>> "{0.-}".format(2) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'int' object has no attribute '-' Neither "4" nor "-" is a valid Python identifier, so the actual parsing code disagrees with the documented grammar here as well. Here I think it would be better just to change the documented grammar.

Here's my use case.

I'm writing a python version of the ruby library HighLine for CLI interaction, to be called, uncreatively, PyLine. One of the moderately neat things about the library is that it allows for color information to be embedded in the strings one passes to its methods, so, if h is a HighLine object, you could say:

h.say "<%= color('this will be red', :red) %> but this won't be"

So I wanted to be able to provide some kind of similar facility and realized that the __getitem__ method supported by format(), along with some __getattribute__ trickery, would work: so if p is a PyLine object, you could say:

p.say("{colors.red.bold.on_black[this will be bold with red text on a black background]} but this will be just be regular text")

Thus:

>>> effectize_string("{colors.red.bold.on_black[this will be bold with red text on a black background]} but this will just be regular text")
'\x1b[31m\x1b[1m\x1b[40mthis will be bold with red text on a black background\x1b[0m but this will just be regular text\x1b[0m'

Obviously, I'll already have to watch out for stray "]"s in the string passed to the object's __getitem__, so you might think, well, it's not much more work to also have to watch out for stray ":", "!", "}", and "{" (but, oddly I won't need to watch out for *match* "{" and "}"!).

But it's obvious that something here should change. For one thing, as it stands, the documentation is wrong; it is not the case that an index_string can contain any character except ']'. But the documentation describes the way things rationally ought to be; there's a good reason not to allow a ']' in the index_string (and one can see why simplicity suggests not allowing for *escapes*, though I think that ideally there would be an escaping mechanism). But there's no reason not to allow stray "{", "}", ":", and "!" in the index_string. The only reason it's true at this point that "it doesn't know they're in an index field when it's doing the parsing for ':' or '!'" is that (assuming one takes the grammar in the documentation to be accurate) the parser is written incorrectly.

It contains, for instance, incorrect comments (in string_format.h:parse_field):
<code>
    /* Search for the field name.  it's terminated by the end of
       the string, or a ':' or '!' */
    field_name->ptr = str->ptr;
    while (str->ptr < str->end) {
        switch (c = *(str->ptr++)) {
        case ':':
        case '!':
            break;
        default:
            continue;
        }
        break;
    }
</code>

(hopefully &lt;code&gt; does the right thing here...)

That's the culprit for the mishandling of ":" and "!", but it is simply not the case---again, according to the grammar given in the documentation---that the field name can be delimited this way, in two ways.*

And, given that no nested expansion is done in the field_name part of the replacement, there's no real reason to retain the present parsing strategy; none of !, :, {, or } has any semantic significance in this part of of the replacement string, so why should the parsing code treat them specially? Surely, even if you think my use case is not so great, there's value in doing it right.

The ":" and "!" problem is not super hard to get around. Witness the following dirty hack:

<code>
void 
advance_beyond_field(SubString *str)
{
    if (str->ptr > str->end) return;
    switch (*++str->ptr) {
    case '[':
        while(str->ptr < str->end && *(str->ptr) != ']') 
            str->ptr++;
        advance_beyond_field(str);
        break;
    case '.':
        while(str->ptr < str->end)
            switch(*++str->ptr) {
            case ':':
            case '!':
                str->ptr--;
                return;
            case '[':
                advance_beyond_field(str);
                str->ptr--;
                break;
            default:
                continue;
            }
        break;
    default:
        return;
    }       
}
</code>
Followed by replacing the switch statement as above thus:
<code>
        switch (c = *(str->ptr++)) {
        case '.':
        case '[':
            str->ptr -= 2;
            advance_beyond_field(str);
            continue;
        case ':':
        case '!':
            break;
        default:
            continue;
        }
</code>
Of course, there is already in the FieldNameIterator plumbing a more certain mechanism for actually getting the fields out.

Then one can do this:

>>> "{0[:]}".format({":":4})
'4'
>>> "{0[{ : ! }]}".format({"{ : ! }":4})
'4'

(One can also pass such formatting-exercising test suites as test_nntplib, test_string, and test_collections.)

Though still not this:

>>> "{0[{]}".format({"{":4})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unmatched '{' in format

Even though the stray "{" in the square brackets has no semantic significance, it still gets picked up; the culprit is apparently in MarkupIterator_next, whose initial bracket-detecting while loop is not square-bracket aware. One couldn't just add a "case '[':" to its switch statement because '[' could be a fill character, but since a fill character can only occur after a ':', and a field_name *can't* occur after a ':' or a '!' a flag for whether a '[' is significant could presumably get around that. Something like (not even remotely tested):
<code>
        case '[':
            if (bracket_significant)
                while(self->str.ptr < self->str.end && *self->str.ptr != ']') {
                    self->str.ptr++;
                }
            continue;
        case ':':
        case '!':
            bracket_significant = 0;
</code>
bracket_significant having been initialized to 1.

If something like the above works, then it seems to me that it would take a very small benefit to outweigh the effort necessary to do this right.

* the second way I don't really care about: the grammar identifies an "attribute_name" as an identifier, but in fact any string will work and will be passed to getattr: 
>>> "{0.4}".format(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute '4'
>>> "{0.-}".format(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute '-'
Neither "4" nor "-" is a valid Python identifier, so the actual parsing code disagrees with the documented grammar here as well. Here I think it would be better just to change the documented grammar.

History
Date	User	Action	Args
2011-05-06 20:40:16	Ben.Wolfson	set	recipients: + Ben.Wolfson, mark.dickinson, eric.smith, eric.araujo
2011-05-06 20:40:16	Ben.Wolfson	set	messageid: <1304714416.55.0.30233037943.issue12014@psf.upfronthosting.co.za>
2011-05-06 20:40:16	Ben.Wolfson	link	issue12014 messages
2011-05-06 20:40:15	Ben.Wolfson	create