Issue 12014: str.format parses replacement field incorrectly

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56223

classification

Title:	str.format parses replacement field incorrectly
Type:	behavior	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.2, Python 3.3, Python 3.4, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	eric.smith	Nosy List:	Ben.Wolfson, barry, belopolsky, benjamin.peterson, eric.araujo, eric.smith, flox, mark.dickinson, ncoghlan, petri.lehtinen, r.david.murray, rhettinger
Priority:	normal	Keywords:	needs review, patch

Created on 2011-05-06 01:45 by Ben.Wolfson, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
strformat.diff	Ben.Wolfson, 2011-05-10 22:07		review
strformat-as-documented.diff	Ben.Wolfson, 2011-07-07 00:04	bring behavior of str.format into line with documentation	review
strformat-just-identifiers-please.diff	Ben.Wolfson, 2011-07-07 00:05	use only identifiers or integers in the field_name part of a replacement field.	review
strformat-no-braces.diff	Ben.Wolfson, 2012-05-25 00:05	patch for current codebase/no braces in index string	review

Messages (45)
msg135258 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-05-06 01:45
As near as I can make out from <http://docs.python.org/library/string.html#formatstrings>, the following should return the string "hi": "{0[!]}".format({"!":"hi"}) We have a "{", followed by a field name, followed by a "}", the field name consisting of an arg_name, which is 0, a "[", an element index, and a "]". The element index, which the docs say may be any source character except "]", is here "!". And, according to the docs, "An expression of the form '.name' selects the named attribute using getattr(), while an expression of the form '[index]' does an index lookup using __getitem__()". However, it doesn't work: >>> "{0[!]}".format({"!":"hi"}) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: Missing ']' in format string The same thing happens with other strings that are significant in other places in the string-formatting DSL: >>> "{0[:]}".format({":":"hi"}) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: Missing ']' in format string If there are more characters the error message changes: >>> class spam: ... def __getitem__(self, k): return "hi" ... >>> "{0[this works as expected]}".format(spam()) 'hi' >>> "{0[I love spam! it is very tasty.]}".format(spam()) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: expected ':' after format specifier >>> "{0[.]}".format(spam()) # periods are ok 'hi' >>> "{0[although curly braces, }, are not square brackets, they also don't work here]}".format(spam()) Right square brackets work fine, though: >>> "{0[[]}".format(spam()) 'hi' The failure of the expected result with curly braces presumably indicates at least part of the cause of the other failures: namely, that they stem from supporting providing flags to one replacement field using another, as in "{1:<{0}}". Which is quite useful. But it obviously isn't universally supported in the case of index fields anyway: >>> "{0[recursive {1[spam]}]}".format(spam(), spam()) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: Only '.' or '[' may follow ']' in format field specifier (Note that this is a very strange error message itself, asis the following, but since one isn't, according to the grammar, allowed to include a "]" where I've got one anyway, perhaps that's to be expected: >>> "{0[recursive {1[spam].lower} ]}".format(spam(), spam()) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'str' object has no attribute 'lower} ]' ) But, even if that would explain why one can't use a "{" in the index field, it wouldn't explain why one can't use a "!" or ":", since if those aren't already part of a replacement field, as indicated by some initial "{", they couldn't have the significance that they do when they are part of that field.
msg135267 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2011-05-06 08:47
I haven't had time to completely review this, I will do so later today. But let me just say that the string is first parsed for replacement strings inside curly braces. There's no issue with that, here. Next, the string is parsed for conversion and format_spec, looking for "!" and ":" respectively. In your first example that gives: field_name: '0[' conversion : ']' It then tries to parse the field_name and gives you the first error.
msg135299 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-05-06 14:41
The semantics the docs suggest for index fields (namely that whatever is in the index field is just passed to getitem) do seem to be right, no other processing is done here, for instance: >>> d = {"{0}":"hi"} >>> "{0[{0}]}".format(d) 'hi' >>> import string >>> list(string.Formatter().parse("{0[{0}]}")) [('', '0[{0}]', '', None)] >>> Which is what you'd expect, but makes me think that treating "!" and ":" in the index field separately is definitely wrong.
msg135339 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2011-05-06 17:23
> but makes me think that treating "!" and ":" in the index field separately is definitely wrong. But it doesn't know they're in an index field when it's doing the parsing for ':' or '!'. It might be possible to change this so that the field name is fully parsed first, but I'm not sure the benefit would justify the effort. What's your use case where you need this feature?
msg135344 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2011-05-06 17:35
Note also that the nested expansion is only allowed in the format_spec part, per the documentation. Your last examples are attempting to do it in the field_name, which leads to the errors you see. Your very last example doesn't look right to me. I'll have to investigate why it's giving you that error message.
msg135355 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-05-06 18:21
My last examples were actually just attempting to figure out what triggered the unexpected behavior. I don't want to do expansion inside the field_name part! (I'll have a reply to your previous comment about use-cases shortly.)
msg135368 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-05-06 20:40
Here's my use case. I'm writing a python version of the ruby library HighLine for CLI interaction, to be called, uncreatively, PyLine. One of the moderately neat things about the library is that it allows for color information to be embedded in the strings one passes to its methods, so, if h is a HighLine object, you could say: h.say "<%= color('this will be red', :red) %> but this won't be" So I wanted to be able to provide some kind of similar facility and realized that the __getitem__ method supported by format(), along with some __getattribute__ trickery, would work: so if p is a PyLine object, you could say: p.say("{colors.red.bold.on_black[this will be bold with red text on a black background]} but this will be just be regular text") Thus: >>> effectize_string("{colors.red.bold.on_black[this will be bold with red text on a black background]} but this will just be regular text") '\x1b[31m\x1b[1m\x1b[40mthis will be bold with red text on a black background\x1b[0m but this will just be regular text\x1b[0m' Obviously, I'll already have to watch out for stray "]"s in the string passed to the object's __getitem__, so you might think, well, it's not much more work to also have to watch out for stray ":", "!", "}", and "{" (but, oddly I won't need to watch out for match "{" and "}"!). But it's obvious that something here should change. For one thing, as it stands, the documentation is wrong; it is not the case that an index_string can contain any character except ']'. But the documentation describes the way things rationally ought to be; there's a good reason not to allow a ']' in the index_string (and one can see why simplicity suggests not allowing for escapes, though I think that ideally there would be an escaping mechanism). But there's no reason not to allow stray "{", "}", ":", and "!" in the index_string. The only reason it's true at this point that "it doesn't know they're in an index field when it's doing the parsing for ':' or '!'" is that (assuming one takes the grammar in the documentation to be accurate) the parser is written incorrectly. It contains, for instance, incorrect comments (in string_format.h:parse_field): <code> /* Search for the field name. it's terminated by the end of the string, or a ':' or '!' / field_name->ptr = str->ptr; while (str->ptr < str->end) { switch (c = (str->ptr++)) { case ':': case '!': break; default: continue; } break; } </code> (hopefully <code> does the right thing here...) That's the culprit for the mishandling of ":" and "!", but it is simply not the case---again, according to the grammar given in the documentation---that the field name can be delimited this way, in two ways.* And, given that no nested expansion is done in the field_name part of the replacement, there's no real reason to retain the present parsing strategy; none of !, :, {, or } has any semantic significance in this part of of the replacement string, so why should the parsing code treat them specially? Surely, even if you think my use case is not so great, there's value in doing it right. The ":" and "!" problem is not super hard to get around. Witness the following dirty hack: <code> void advance_beyond_field(SubString str) { if (str->ptr > str->end) return; switch (++str->ptr) { case '[': while(str->ptr < str->end && (str->ptr) != ']') str->ptr++; advance_beyond_field(str); break; case '.': while(str->ptr < str->end) switch(++str->ptr) { case ':': case '!': str->ptr--; return; case '[': advance_beyond_field(str); str->ptr--; break; default: continue; } break; default: return; } } </code> Followed by replacing the switch statement as above thus: <code> switch (c = (str->ptr++)) { case '.': case '[': str->ptr -= 2; advance_beyond_field(str); continue; case ':': case '!': break; default: continue; } </code> Of course, there is already in the FieldNameIterator plumbing a more certain mechanism for actually getting the fields out. Then one can do this: >>> "{0[:]}".format({":":4}) '4' >>> "{0[{ : ! }]}".format({"{ : ! }":4}) '4' (One can also pass such formatting-exercising test suites as test_nntplib, test_string, and test_collections.) Though still not this: >>> "{0[{]}".format({"{":4}) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: unmatched '{' in format Even though the stray "{" in the square brackets has no semantic significance, it still gets picked up; the culprit is apparently in MarkupIterator_next, whose initial bracket-detecting while loop is not square-bracket aware. One couldn't just add a "case '[':" to its switch statement because '[' could be a fill character, but since a fill character can only occur after a ':', and a field_name can't* occur after a ':' or a '!' a flag for whether a '[' is significant could presumably get around that. Something like (not even remotely tested): <code> case '[': if (bracket_significant) while(self->str.ptr < self->str.end && self->str.ptr != ']') { self->str.ptr++; } continue; case ':': case '!': bracket_significant = 0; </code> bracket_significant having been initialized to 1. If something like the above works, then it seems to me that it would take a very small benefit to outweigh the effort necessary to do this right. the second way I don't really care about: the grammar identifies an "attribute_name" as an identifier, but in fact any string will work and will be passed to getattr: >>> "{0.4}".format(2) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'int' object has no attribute '4' >>> "{0.-}".format(2) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'int' object has no attribute '-' Neither "4" nor "-" is a valid Python identifier, so the actual parsing code disagrees with the documented grammar here as well. Here I think it would be better just to change the documented grammar.
msg135747 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-05-10 22:07
Actually, that's the wrong place in MarkupIterator_next to include that loop. The attached diff has it in the right place. The results of "make test" here are: 328 tests OK. 1 test failed: test_unicode 25 tests skipped: test_codecmaps_cn test_codecmaps_hk test_codecmaps_jp test_codecmaps_kr test_codecmaps_tw test_curses test_dbm_gnu test_epoll test_gdb test_largefile test_msilib test_ossaudiodev test_readline test_smtpnet test_socketserver test_startfile test_timeout test_tk test_ttk_guionly test_urllib2net test_urllibnet test_winreg test_winsound test_xmlrpc_net test_zipfile64 1 skip unexpected on darwin: test_readline make: [test] Error 1 (ignored) test_unicode fails because it expects "{0[}".format() to raise an IndexError; instead, it raises a ValueError ("unmatched '{' in format") because it interprets the "}" as an index. This can be avoided by changing the line while (self->str.ptr < self->str.end && self->str.ptr != ']') { to while (self->str.ptr < self->str.end-1 && self->str.ptr != ']') { In which case the test passes as is, or, obviously, by changing the expected exception in test_unicode.py.
msg137523 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-06-03 15:13
The documentation is, in principle, wrong. The actual authority for the "correct" implementation is PEP3101, which says the following: The str.format() function will have a minimalist parser which only attempts to figure out when it is "done" with an identifier (by finding a '.' or a ']', or '}', etc.). Changing that specification would require a discussion on python-dev.
msg137524 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-06-03 15:14
Note that the PEP also explicitly addresses your concern about getattr, as well (validation of the name is delegated to the object's __getattr__).
msg137560 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-06-03 17:08
Hm. As I interpret this: The str.format() function will have a minimalist parser which only attempts to figure out when it is "done" with an identifier (by finding a '.' or a ']', or '}', etc.). The present implementation is at variance with both the documentation and the PEP, since the present implementation does not in fact figure out when it's "done" with an identifier that way. However, this statement is actually a very thin reed on which to make any decisions: a real authority shouldn't say "etc." like that! And, of course, we have to add an implicit "depending on what it's currently looking at" to the parenthetical, because the two strings "{0[a.b]}" and "{0[a].b}" are, and should be, treated differently. In particular, although one could "find" a '.' in the element_index in the former string, the "minimalist parser" should not (and does not) conclude that it's done with the identifier there: >>> "{0[a.b]}".format({"a.b":1}) '1' Instead it treats the '.' as just another character with no particular syntactic significance, the same way it does 'a' and 'b'. It's a shame that the PEP doesn't go into more detail than it does about this sort of thing. The same should go for '}', when we're looking at an element_index field. It should be treated as just another character with no particular syntactic significance. At present that is not the case: >>> "{0[a}b]}".format({"a}b":1}) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: Missing ']' in format string If the attached patch were used, the above expression would evaluate to '1' just as did the first one. Now, given the fact that the PEP actually says quite little about how this sort of thing is to be handled, and given (as demonstrated above with the case of the '.' character) that we can't take the little list it gives as indicating when it's done with an identifier regardless of context, I don't think this change would constitute a change to the specification; it does, admittedly, constitute an interpretation of the specification, but then, so does the present implementation, and the present implementation is at variance with the PEP anyway, as regards the characters ':' and '!'. The paragraph prior to the one quoted by R. David Murray reads: Because keys are not quote-delimited, it is not possible to specify arbitrary dictionary keys (e.g., the strings "10" or ":-]") from within a format string. I take it that this means (in the first place) that, because a sequence of digits is interpreted as a number, the following will fail: '{0[10]}'.format({"10":4}) And indeed it does. The second example is rather unfortunate, though: is the reason one can't use that key because it contains a colon? Or because it contains a right square bracket? Even if the present patch is accepted one couldn't use a right square bracket, since a parser that could figure out where to draw the lines in something like this: '{0[foo ] bar]}' would not be very minimalist. However, as I have noted previously, there is no reason to rule out colons and exclamation points in the element_index field. The PEP doesn't actually take up this question in detail. (It hardly does so at all.) However, according to what I think the most reasonable interpretation of the PEP is, the present implementation is at variance with the PEP. The present implementation is certainly at variance with the documentation, which represents to some extent an interpretation and specification of the PEP. Consequently, to the extent that changing a specification requires discussion on python-dev, it seems to me that the present implementation is already a de facto change to the specification, while accepting the attached patch would bring the implementation into greater accord with the specification---so that (to conclude cheekily) not accepting the patch is what should require discussion on python-dev. However, if it is thought necessary, I'll be happy to start the discussion.
msg137567 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-06-03 17:41
I agree that the current situation is a bit murky and ought to be clarified, but I'm going to leave it to Eric to point they way forward, as he is far more knowledgeable about this area than I.
msg137568 - (view)	Author: Petri Lehtinen (petri.lehtinen) *	Date: 2011-06-03 18:25
I've played around with the str.format() code for a few weeks now, to investigate its poor performance compared to the % operator. Having written a few parsers before, I would change it to parse each part separately: 1. field_name 2a. if followed by '[': element_index (anything until ']') 2b. elif followed by '.': attribute_name 3. if followed by '!': conversion 4. if followed by '}': format_spec (anything until '}') It seems to me that the documentation also suggests this behavior, and that this bug report is correct. What comes to parsing identifiers, it seems to me that stopping at '.', ']', and '}' is not enough. In field_name, '[', ':' and '!' would also be needed, and ':' and '!' in attribute_name. It's a shame that PEP3101 is so vague on this subject.
msg137576 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2011-06-03 19:47
PEP 3101 defines format strings as intermingled character data and markup. Markup defines replacement fields and is delimited by braces. Only after markup is extracted does the PEP talk about interpreting the contents of the markup. So, given "{0[a}b]}" the parser first parses out the character data and the markup. The first piece of markup is "{0[a}". That gives a syntax error because it's missing a right bracket. I realize you'd like the parser to find the markup as the entire string, but that's not how I read the PEP.
msg137588 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-06-03 21:08
""" PEP 3101 defines format strings as intermingled character data and markup. Markup defines replacement fields and is delimited by braces. Only after markup is extracted does the PEP talk about interpreting the contents of the markup. So, given "{0[a}b]}" the parser first parses out the character data and the markup. The first piece of markup is "{0[a}". That gives a syntax error because it's missing a right bracket. """ The intermingling of character data and markup is irrelevant; character data is defined as "data which is transferred unchanged from the format string to the output string", and nothing in "{0[a]}" is transferred unchanged. Two parts of the PEP suggest that the markup in the above should be "{0[a}" rather than "{0[a}]}": Brace characters ('curly braces') are used to indicate a replacement field within the string: [...] Braces can be escaped by doubling: and Note that the doubled '}' at the end, which would normally be escaped, is not escaped in this case. The reason is because the '{{' and '}}' syntax for escapes is only applied when used outside of a format field. Within a format field, the brace characters always have their normal meaning. The first statement obviously doesn't mean that the exclusive use of braces in a format string is to indicate replacement fields, since it's immediately acknowledged that sometimes braces can occur without indicating a replacement field, when they're escaped. The second occurs specifically in the context of talking about escaping braces, so the following interpretation remains available: within a format field, a brace is a brace is a brace---that is, a pair of braces is a pair of braces, not an escape for a single brace. In fact, though the following argument may appear Jesuitical, it does, I think, hold water: The second quotation above mentions braces within a format field. What is a format field? Well, we know that "The element with the braces is called a 'field'", but "format field" is more specific; the whole thing between braces isn't (necessarily!) the format field. And we know that Fields consist of a 'field name', which can either be simple or compound, and an optional 'format specifier'. So, perhaps a format field is the part of the broader field where the format specifier lives. And lo, it's in the part of the PEP talking about "Format Specifiers" that we get the second quotation above. Each field can also specify an optional set of 'format specifiers' which can be used to adjust the format of that field. Format specifiers follow the field name, with a colon (':') character separating the two: So even if you think that the claim that "within a format field, the brace characters always have their normal meaning" means not "the brace characters aren't escaped" but "the brace characters indicate a replacement field", that statement could just mean that they only have this significance in part of the replacement field---the part having to do with the formatting of the replacement text---and not the whole replacement field. So that, for instance, the following does what you'd expect: >>> "{0[{4}]}".format({"{4}":3}) '3' And it does do what you'd expect, in the current implementation---that is, the braces here don't have the meaning of introducing a replacement field [they're kinda-sorta parsed as if they were introduced a replacement field but that is obviously not their semantics], but are instead just treated as braces. They also aren't escaped: >>> "{0[{{4}}]}".format({"{{4}}":3}) '3'
msg137594 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2011-06-03 21:56
The intermingling of character data and markup is far from irrelevant: that's exactly what str.format() does! I don't see how it can be irrelevant to a discussion of how the string is parsed. Note that there are no restrictions, in general, on what's in a format specifier. Braces can be in format specifiers, if they make sense for that type. For example: >>> from datetime import datetime >>> format(datetime.now(), '{}%Y-%m-%d}{') '{}2011-06-03}{' It's definitely true that you can have valid format specifiers that cannot be represented in strings parsed by str.format(). The PEP talks about both format specifiers in the abstract (stand alone) and format specifiers contained in str.format() strings. The current implementation of str.format() finds matched pairs of braces and call what's inside "markup", then parse that markup. This indeed restricts what's inside the markup. I believe the implementation is compliant with the PEP. It's also true that other interpretations of the PEP are possible. I'm just not sure the benefit to be gained justifies changing all of the extant str.format() implementations, in addition to explaining the different behavior. Many useful features for str.format() were rejected in order to keep the implementation and documentation simple. I'm not saying change and improvement is impossible. I'm just not convinced it's worthwhile.
msg137601 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-06-03 22:17
str.format doesn't intermingle character data and markup. The PEP is quite clear about the terms in this case, at least: the argument to str.format consists of character data (passed through unchanged) and markup (processed). That's what it means to say that "Character data is data which is transferred unchanged from the format string to the output string". In "My name is {0}", "My name is " is transferred unchanged from the format string to the output string when the string is formatted. We're talking about how the markup is defined. """ The current implementation of str.format() finds matched pairs of braces and call what's inside "markup", then parse that markup. """ This is false, as I demonstrated. >>> d = {"{0}": "spam"} >>> # a matched pair of braces. What's inside is considered markup. ... >>> "{0}".format(d) "{'{0}': 'spam'}" >>> # a matched pair of braces. Inside is a matched pair of braces, and what's inside of that is not considered markup. ... >>> "{0[{0}]}".format(d) 'spam' >>> """ It's also true that other interpretations of the PEP are possible. I'm just not sure the benefit to be gained justifies changing all of the extant str.format() implementations, in addition to explaining the different behavior. """ Well, the beauty of it is, you wouldn't have to explain the different behavior, because the patch makes it the case that the explanation already in the documentation is correct. It is currently not correct. That's why I found out about this current state of affairs: I read the documentation's explanation and believed it, and only after digging into the code did I understand the actual behavior. It is also not a difficult change to make, would be backwards-compatible (anyway I rather doubt anyone was relying on a "{0[:]}".format(whatever) raising an exception [1]), and relaxes a restriction that is not well motivated by the text of the PEP, is not consistently applied in the implementation (see above), and is confusing and limits the usefulness of the format method. It is true that I don't know where else, beyond the implementation in string_format.h, modifications would need to be made, but I'd be willing to undertake the task. [1] and given that the present implementation does that, it's already noncompliant with the PEP, regardless of what one makes of curly braces.
msg137603 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2011-06-03 22:20
From the PEP: "Format strings consist of intermingled character data and markup."
msg137607 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2011-06-03 22:52
""" >>> d = {"{0}": "spam"} >>> # a matched pair of braces. What's inside is considered markup. ... >>> "{0}".format(d) "{'{0}': 'spam'}" >>> # a matched pair of braces. Inside is a matched pair of braces, and what's inside of that is not considered markup. """ I'm not sure what' you're getting at. "{0}" (which is indeed "markup") is replaced by str(d), which is "{'{0}': 'spam'}". """ ... >>> "{0[{0}]}".format(d) 'spam' >>> """ Again, I'm not sure what you're getting at. The inner "{0}" is not interpreted (per the PEP). So the entire string is replaced by d['{0}'], or 'spam'. Let me try to explain it again. str.format() parses the string, looking for matched sets of braces. In your last example above, the very first character '{' is matched to the very last character '}'. They match, in sense that all of the nested ones inside match. Once the markup is separated from the character data, the interpretation of what's inside the markup is then done. In this example, there is no character data. I apologize if I'm explaining this poorly.
msg137615 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-06-04 00:36
""" From the PEP: "Format strings consist of intermingled character data and markup." """ I know. Here is an example of a format string: "hello, {0}" Here is the character data from that format string: "hello, " Here is the markup: "{0}" This follows directly from the definition of "character data", which I've quoted several times now. In the following expression: "{0}".format(1) there is NO character data, because there is NOTHING which is "which is transferred unchanged from the format string to the output string". The "{0}" doesn't appear in the output string at all. And the 1 isn't transferred unchanged: it has str() called on it. Since there is nothing which meets the definition of character data, there is nothing which is character data in the string, regarded as a format string. It is pure markup---it consists solely of a replacement field delimited by curly braces. I really don't see why this matters at all, but, nevertheless, I apologize if I'm explaining it poorly. """ Again, I'm not sure what you're getting at. The inner "{0}" is not interpreted (per the PEP). So the entire string is replaced by d['{0}'], or 'spam'. Let me try to explain it again. str.format() parses the string, looking for matched sets of braces. In your last example above, the very first character '{' is matched to the very last character '}'. They match, in sense that all of the nested ones inside match. Once the markup is separated from the character data, the interpretation of what's inside the markup is then done. In this example, there is no character data. """ Yes, there is no character data. And I understand perfectly what is happening. Here's the problem: your description of what the implementation does is incorrect. You say that """ The current implementation of str.format() finds matched pairs of braces and call what's inside "markup", then parse that markup. """ Now, the only reason for thinking that this: "{0[}]}" should be treated differently from this: "{0[a]}" is that inside square brackets curly brackets indicate replacement fields. If you want to justify what the current implementation does as an implementation of the PEP and an interpretation of what the PEP says, you have to think that. But if you think that, then the current implementation should not treat this: "{0[{0}]}" the way it does, because it does not treat the interior curly braces as indications of a replacement field---or rather, it does at one point in the source (in MarkupIterator_next) and it doesn't at another (in FieldNameIterator). I agree that what the current implementation does in the last example is in fact correct. But if it's correct in the one case, it's incorrect in the other, and vice versa. There is no justification, in terms of the PEP, for the present behavior.
msg137617 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2011-06-04 01:03
We're going to have to agree to disagree. I believe that "{0[}]}" is the markup "{0[}" followed by the character data "]}".
msg137656 - (view)	Author: Petri Lehtinen (petri.lehtinen) *	Date: 2011-06-04 18:03
> PEP 3101 defines format strings as intermingled character data and markup. Markup defines replacement fields and is delimited by braces. Only after markup is extracted does the PEP talk about interpreting the contents of the markup. > > So, given "{0[a}b]}" the parser first parses out the character data and the markup. The first piece of markup is "{0[a}". That gives a syntax error because it's missing a right bracket. > > I realize you'd like the parser to find the markup as the entire string, but that's not how I read the PEP. This is a good point, although the support of further replacement fields inside format_specifiers requies the parser to count matching braces, if the markup is to be extracted before its interpreted. But disallowing unmathced '}' inside the replacement field doesn't still explain why this shouldn't work: '{0[!]!r}'.format({'!': 'foo'}) I'm completely fine with disallowing '}', but it seems to me that there's absolutely no reason to not parse the element_index and later fields correctly with respect to '!' and ':'.
msg139955 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-07-06 22:50
This patch differs from the previous one; its goal is to bring the actual behavior of the interpreter into line with the documentation (with the exception of using only decimal integers, rather than any integers, wherever the documentation for str.format currently has "integer": this does, however, conform with current behavior).
msg139958 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-07-06 22:59
And here is a patch for Greg Ewing's proposal: http://mail.python.org/pipermail/python-dev/2011-June/111934.html Again, decimal integers rather than any kind of integers are used. Both patches alter the exceptions expected in various places in test_unicode's test_format: "{0.}".format() raises a ValueError (because the format string is invalid) rather than an IndexError (because there is no argument) "{0[}".format(), likewise. "{0]}".format() raises a ValueError (because the format string is invalid) rather than a KeyError (because "0]" is taken to be the name of a keyword argument---meaning that the test suite was testing the actual behavior of the implementation rather than the documented behavior). "{c]}".format(), likewise. In this patch, "{0[{1}]}".format('abcdef', 4) raises a ValueError rather than a TypeError, because "{1}", being neither a decimalinteger nor an identifier, invalidates the replacement field. Both patches also add tests for constructions like this: "{[0]}".format([3]) --> '3' "{.__class__}".format(3) --> "<type 'int'>" This conforms with the documentation (and current behavior), since in it arg_name is defined to be optional, but it is not currently covered in test_format, that I could tell, anyway.
msg139959 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2011-07-06 23:24
Please stick with "integer" instead of "decimalinteger". In an effort to make the docs more precise, there is an unintended effect of making them harder to understand.
msg139962 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-07-07 00:04
undo integer -> decimalinteger in docs
msg139963 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-07-07 00:05
(same as previous)
msg148542 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-11-29 07:00
I just noticed that the patch labelled strformat-as-document is actually the same as the other one, owing to my incompetence. Anyway, as far as I can tell the patches would have to be reworked in the light of recent changes anyway. I am willing to do this if there's actually interest. Otherwise, is there anything else I can do here? Is it necessary to write a PEP or take this to python-ideas or something?
msg148642 - (view)	Author: Petri Lehtinen (petri.lehtinen) *	Date: 2011-11-30 07:53
> I just noticed that the patch labelled strformat-as-document is > actually the same as the other one, owing to my incompetence. All three patches look different to me. > Anyway, as far as I can tell the patches would have to be reworked > in the light of recent changes anyway. I am willing to do this if > there's actually interest. Otherwise, is there anything else I can > do here? Is it necessary to write a PEP or take this to python-ideas > or something? There's still interest, at least from me :) In my opinion we should have the documented behavior (integer or identifier as field_name), AND braces should be disallowed inside the format string, with the exception of one level of nesting in the format_spec part. This should probably be taken to python-dev once more, as the previous discussion didn't reach consesus, except that the current approach is bad and something needs to be done.
msg148667 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2011-11-30 15:49
"All three patches look different to me." Yeah, I verified that later; I'm not sure what made me think otherwise except that I eyeballed them sloppily. (It's still true that they'd need to target a different file for 3.3 now.)
msg155718 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2012-03-14 03:42
just curious if there are any developments here since the first 3.3 alpha has been released.
msg161163 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2012-05-19 21:24
Ping!
msg161169 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2012-05-19 22:57
I'll look at it when I'm done with PEP 420.
msg161368 - (view)	Author: Petri Lehtinen (petri.lehtinen) *	Date: 2012-05-22 18:28
Ben, As I've said, I think that we should go for the documented behavior with the addition of not allowing braces inside the format string (with the exception of format_spec). So AFAICS, index_string would become index_string ::= <any source character except "]" or "{" or "}"> + > Anyway, as far as I can tell the patches would have to be reworked in > the light of recent changes anyway. I am willing to do this if there's > actually interest. Are you still willing to rework the patches? And as I said already earlier, it wouldn't hurt if this was taken to python-dev once more. If there's a good, working patch ready, it might make it easier to gain consensus.
msg161382 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2012-05-22 21:16
> Are you still willing to rework the patches? Sure. Now that I've actually looked at unicode_format.h it looks like the biggest (relevant) difference might just be that the file isn't named string_format.h, so I suspect it will be pretty straightforward. > And as I said already earlier, it wouldn't hurt if this was taken to > python-dev once more. If there's a good, working patch ready, it might > make it easier to gain consensus. Maybe, but the last time it went to python-dev (in December) there was little discussion at all, and the patches that exist now worked on the codebase as it existed then. Anyway, it seems as if progress is being made on PEP 420, so perhaps better to let Eric take a look before bringing it up again?
msg161403 - (view)	Author: Petri Lehtinen (petri.lehtinen) *	Date: 2012-05-23 12:20
Ben Wolfson wrote: > Maybe, but the last time it went to python-dev (in December) there > was little discussion at all, and the patches that exist now worked > on the codebase as it existed then. Maybe it's pointless to bring it up on python-dev then. I just thought that people might feel strongly about this. > Anyway, it seems as if progress is being made on PEP 420, so perhaps > better to let Eric take a look before bringing it up again? Let's wait for Eric's comments, as he implemented format() in the first place.
msg161537 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2012-05-25 00:05
Here's a patch that works against the current unicode_format.h and implements what Petri suggested.
msg161555 - (view)	Author: Petri Lehtinen (petri.lehtinen) *	Date: 2012-05-25 07:18
I added some comments on rietveld. These are only nit-picking about style and mostly reflect my personal taste, not show stoppers in any case.
msg163080 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2012-06-17 18:08
I can certainly address those issues---I'll hold off on doing so, though, until it's clearer whether more substantive things come up, so I can just do it in a swoop.
msg166077 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2012-07-21 21:39
Ping!
msg169470 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2012-08-30 17:05
You can bring this up to python-dev to get other developers’ opinion.
msg182323 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-02-18 14:39
This actually came up on the core-mentorship list (someone was trying to translate old mod-formatting code that used a colon in the lookup names and discovered this odd behaviour) My own preference is to let this quote from PEP 3101 dominate the behaviour: "The rules for parsing an item key are very simple. If it starts with a digit, then it is treated as a number, otherwise it is used as a string." That means Petri's suggested solution (allowing any character except a closing square bracket and braces in the item key) sounds good to me.
msg182445 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2013-02-19 22:17
"My own preference is to let this quote from PEP 3101 dominate the behaviour: "The rules for parsing an item key are very simple. If it starts with a digit, then it is treated as a number, otherwise it is used as a string." That means Petri's suggested solution (allowing any character except a closing square bracket and braces in the item key) sounds good to me." But ... that isn't what the quotation from the PEP says, since it doesn't exclude braces. I also don't really see why the PEP should be given much authority in this issue, since it pays extremely cursory attention to this part of the format. In any case, judging by the filename and description (god knows I can't remember, having written it nine months ago), strformat-no-braces.diff implements that behavior. (Oh, now I see from an earlier comment of mine that that is, in fact, what it does.) Meanwhile, it was five months ago that Eric Smith said "It's on my list of things to look at. I have a project due next week, then I'll have some time." I understand that this is not the biggest deal, but the patch is also pretty compact and (I think) easily understood. Petri seemed to think it was mostly ok in May 2012, when, IIRC, several people on python-dev agreed that the current behavior should be changed. God only knows how unicode_format.h has changed in the interim. Peer review for academic papers moves substantially faster than this.
msg193129 - (view)	Author: Ben Wolfson (Ben.Wolfson)	Date: 2013-07-15 21:44
Ping.
msg204544 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2013-11-27 01:12
Should be generally patched up in 3.4. Try it out.

History
Date	User	Action	Args
2022-04-11 14:57:16	admin	set	github: 56223
2014-04-06 23:17:50	benjamin.peterson	set	status: open -> closed resolution: fixed
2013-11-27 01:12:04	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg204544
2013-10-20 23:04:21	Ben.Wolfson	set	versions: + Python 3.4
2013-07-15 21:44:15	Ben.Wolfson	set	messages: + msg193129
2013-02-19 22:17:27	Ben.Wolfson	set	messages: + msg182445
2013-02-18 14:39:04	ncoghlan	set	nosy: + ncoghlan messages: + msg182323
2012-09-29 20:57:13	barry	set	nosy: + barry
2012-09-07 03:52:09	belopolsky	set	nosy: + belopolsky
2012-08-30 17:05:49	eric.araujo	set	messages: + msg169470
2012-07-21 21:39:09	Ben.Wolfson	set	messages: + msg166077
2012-06-17 21:12:49	flox	set	nosy: + flox
2012-06-17 18:08:51	Ben.Wolfson	set	messages: + msg163080
2012-05-25 07:18:05	petri.lehtinen	set	messages: + msg161555
2012-05-25 00:05:16	Ben.Wolfson	set	files: + strformat-no-braces.diff messages: + msg161537
2012-05-23 12:20:01	petri.lehtinen	set	messages: + msg161403
2012-05-22 21:16:48	Ben.Wolfson	set	messages: + msg161382
2012-05-22 18:28:14	petri.lehtinen	set	messages: + msg161368 versions: - Python 3.1
2012-05-19 22:57:04	eric.smith	set	messages: + msg161169
2012-05-19 21:24:12	Ben.Wolfson	set	messages: + msg161163
2012-03-14 03:42:06	Ben.Wolfson	set	messages: + msg155718
2011-12-16 01:37:01	eric.smith	set	assignee: eric.smith
2011-11-30 15:49:07	Ben.Wolfson	set	messages: + msg148667
2011-11-30 07:53:10	petri.lehtinen	set	messages: + msg148642
2011-11-29 07:00:02	Ben.Wolfson	set	messages: + msg148542
2011-07-07 00:05:36	Ben.Wolfson	set	files: - strformat-just-identifiers-please.diff
2011-07-07 00:05:27	Ben.Wolfson	set	files: + strformat-just-identifiers-please.diff messages: + msg139963
2011-07-07 00:05:01	Ben.Wolfson	set	files: - strformat-as-documented.diff
2011-07-07 00:04:36	Ben.Wolfson	set	files: + strformat-as-documented.diff messages: + msg139962
2011-07-06 23:24:12	rhettinger	set	nosy: + rhettinger messages: + msg139959
2011-07-06 22:59:50	Ben.Wolfson	set	files: + strformat-just-identifiers-please.diff messages: + msg139958
2011-07-06 22:50:10	Ben.Wolfson	set	files: + strformat-as-documented.diff messages: + msg139955
2011-06-04 18:03:23	petri.lehtinen	set	messages: + msg137656
2011-06-04 01:03:07	eric.smith	set	messages: + msg137617
2011-06-04 00:36:30	Ben.Wolfson	set	messages: + msg137615
2011-06-03 22:52:04	eric.smith	set	messages: + msg137607
2011-06-03 22:20:11	eric.smith	set	messages: + msg137603
2011-06-03 22:17:15	Ben.Wolfson	set	messages: + msg137601
2011-06-03 21:56:58	eric.smith	set	messages: + msg137594
2011-06-03 21:08:42	Ben.Wolfson	set	messages: + msg137588
2011-06-03 19:47:48	eric.smith	set	messages: + msg137576
2011-06-03 18:25:22	petri.lehtinen	set	messages: + msg137568
2011-06-03 17:41:16	r.david.murray	set	messages: + msg137567
2011-06-03 17:09:00	Ben.Wolfson	set	messages: + msg137560
2011-06-03 15:14:39	r.david.murray	set	messages: + msg137524
2011-06-03 15:13:17	r.david.murray	set	nosy: + r.david.murray messages: + msg137523
2011-05-30 09:06:19	petri.lehtinen	set	nosy: + petri.lehtinen
2011-05-11 14:54:50	eric.araujo	set	keywords: + needs review stage: patch review versions: - Python 2.6, Python 3.4
2011-05-10 22:07:54	Ben.Wolfson	set	files: + strformat.diff keywords: + patch messages: + msg135747 versions: + Python 2.6, Python 3.4
2011-05-06 20:40:16	Ben.Wolfson	set	messages: + msg135368
2011-05-06 18:21:23	Ben.Wolfson	set	messages: + msg135355
2011-05-06 17:35:45	eric.smith	set	messages: + msg135344
2011-05-06 17:23:43	eric.smith	set	messages: + msg135339
2011-05-06 17:13:17	eric.araujo	set	nosy: + eric.araujo versions: + Python 2.7, Python 3.2, Python 3.3, - Python 2.6
2011-05-06 14:41:35	Ben.Wolfson	set	messages: + msg135299
2011-05-06 08:47:16	eric.smith	set	messages: + msg135267
2011-05-06 07:18:00	mark.dickinson	set	nosy: + mark.dickinson, eric.smith
2011-05-06 05:07:00	Ben.Wolfson	set	type: behavior
2011-05-06 01:45:59	Ben.Wolfson	create