New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raw string parsing fails with backslash as last character #45612
Comments
If you have a raw string with a backslash as the last character, the >>> print r'\'
SyntaxError: EOL while scanning single-quoted string There seems to have been a fix added to python 3.0 (see bpo-1720390), |
As stated in the docs... r"\" is not a valid string literal (even a raw string cannot |
So basically raw strings are useless if you need to end a string with a |
There's more to allowing \ at the end of a raw string: if you do that, There are basically two main uses for raw strings: Windows path names |
(Not clear how to reopen this issue. Hopefully my change here does that.) OK, so as it currently stands, backslash at end of string is prohibited in the interests of allowing backslash to escape quotes that might be embedded within the string. But the embedded quote scenario doesn't work because the backslash remains in the string. So the current state of play is plain broken. Considering: |
I can certainly agree with the opinion that raw strings are working as documented, but I can also agree with the opinion that they contain traps for the unwary, and after getting trapped several times, I have chosen to put up with the double-backslash requirement of regular strings, and avoid the use of raw strings in my code. The double-backslash requirement of regular strings gets ugly for Windows pathnames and some regular expressions, but the traps of raw strings are more annoying that that. I'm quite sure it would be impossible to "fix" raw strings without causing deprecation churn for people to whom they are useful (if there are any such; hard for me to imagine, but I'm sure there are). I'm quite sure the only reasonable "fix" would be to invent a new type of "escape-free" or "exact" string (to not overuse the term raw, and make two types of raw string). With Python 3, and UTF-8 source files, there is little need for \-prefixed characters (and there is already a string syntax that permits them, when they are needed), so it seems like inventing a new string syntax e'string' which would not treat \ in any special manner whatsoever, would be useful for all the cases raw strings are presently useful for, and even more useful, because it would handle all the cases that are presently traps for the unwary that raw-strings have. The problem mention in this thread of escaping the outer quote character is much more appropriately handled by the triple-quote form. I don't know the Python history well enough to know if raw strings predated triple-quote; if they didn't, there is would have been no need for raw strings to attempt to support such. |
I think perhaps the language in the language reference is a bit misleading. The purpose of the raw string algorithm is that any characters in the string be copied literally into the string object. That is, \" "escapes" the " not so that you can write r"\"", but so that the string object produced by that literal (or by the literal r'\"') contains the two characters \". So, it is certainly true that raw string will *not* be changed to make \ not escape quotes. I can't remember where I read this, but the reason that a trailing \ is invalid has to do with the way the parser parses strings/raw strings. So any alternate "more raw" string type would have to contend with the same parser issues that lead to the exiting restrictions on raw strings. Unless someone wants to do a deep dive into the parser and figure out why things are they way they are and how to fix it, I think this issue should be reclosed. Note that even if a solution can be found, if it significantly complicates the parser it will probably be rejected. And I suspect it would, otherwise things probably wouldn't work the way they currently do. However, perhaps someone can spot a clever solution that was overlooked in the original implementation. |
@Glenn Linderman: I too am usually quick to assume that "innocent fixes" may have serious unforeseen impacts, but in this case I'm not convinced. What would matter is to enumerate the current behavior, and of that what would be changed. You seem to have had experience with other raw-string features/gotchas -- please share! :-) @david Murray: Excuse denseness on my part, but I'm not following the logic of your first paragraph. I think you are saying that current raw string has to do something special to be able to contain the sequence backslash-quote, and this has the side effect of precluding that sequence appearing last in a string. But surely a completely-escape-free string could also contain backslash-quote just fine (assuming the string is surrounded by the other kind of quote). So I'm thinking that the case you mention is not the driver here. It's conceivable there is some more complicated case where backslash-singlequote AND backslash-doublequote MUST appear literally in the same string. However, it seems a little bizarre to worry about that case, but not worry about the simpler case of wanting both a plain singlequote and a plain doublequote in the same string. Maybe there's some popular regular expression that calls for this complexity. I concur that inspection of the parser (and the history and intent of this design) would be fascinating. |
If I'm remembering the discussion I read correctly, what the parser does is to parse the a regular string and a raw string in exactly the same way, but in the raw string case, it does not do the subsequent escape sequence replacement pass on the parsed string. This means that it follows the "escape the quote" rule when *parsing* the string, but does not do the subsequent post-processing that would remove the \. But because the quote-escape has a consequence at the parsing stage, it applies to both raw strings and regular strings. Which means that an odd number of backslashes cannot appear at the end of either type of string. And as far as I can see there is no way to fix that, since otherwise the parser can't identify the end of the string. Therefore if no escaping is done you do, as you say, limit yourself to not being able to put both ' and " characters inside a more-raw string. This would break regular expressions, since exactly this case does occur when using regular expressions extensively. And a trailing backslash would never appear in a regular expression. So, clearly raw strings are optimized for regular expression use, and not for Windows pathname use. The proposed 'windows raw string' literal would be optimized the other way. Adding such a literal is python-ideas territory. Note that windows paths can be spelled with / characters, so this specialized use case has an easy workaround, with the added advantage that a repr of such a string won't appear to have doubled \s (which I always find confusing when debugging programs involving windows path names that use the \ separator). |
@graham: seems like the two primary gotchas are trailing \ and \" \' not removing the \. The one that tripped me up the most was the trailing \, but I did get hit with \" once. Probably if Python had been my first programming language that used \ escapes, it wouldn't be such a problem, but since it never will be, and because I'm forced to use the others from time to time, still, learning yet a different form of "not-quite raw" string just isn't worth my time and debug efforts. When I first saw it, it sounded more useful than doubling the \s like I do elsewhere, but after repeated trip-ups with trailing \, I decided it wasn't for me. @r David: Interesting description of the parsing/escaping. Sounds like that makes for a cheap parser, fewer cases to handle. But there is little that is hard about escape-free or exact string parsing: just look for the trailing " ' """ or ''' that matches the one at the beginning. The only thing difficult is if you want to escape the quote, but with the rich set of quotes available, it is extremely unlikely that you can't find one that you can use, except perhaps if you are writing a parser for parsing Python strings, in which case, the regular expression that matches any leading quote could be expressed as: '("|"""|' "'|''')" Granted that isn't the clearest syntax in the world, but it is also uncommon, and can be assigned to a nicely named variable such as matchLeadingQuotationRE in one place, and used wherever needed. Regarding the use of / rather that \ that is true if you are passing file names to Windows APIs, but not true if you are passing them to other programs that use / as option syntax and \ as path separator (most Windows command line utilities). |
Well, it's still the case that the Python raw string syntax isn't going to change to not escape the quotes, because that would break far too many existing applications that depend on them being escaped. So a new string literal type would seem to be the only option. I'm going to close this issue again; if either of you want to pursue the new string literal type, please bring it up on python-ideas. Frankly, I doubt you would get much traction, but I could be wrong :) |
Thanks to all for your patient comments. I think I am resigned to raw-string forever being medium-rare-string :-). Perhaps it's obvious once you get over the initial shock of non-rawness, but workarounds for the disallowed trailing backslash include (note the final space character): mydir = r"C:\somedir\ ".rstrip() or...
mydir = r"C:\somedir\ "[:-1] It might be worth mentioning one of these in the raw string docs to emphasize that there is this gotcha, that it's easy to fix, and prompting this as an idiom that becomes familiar in applications where it's needed. |
I've opened bpo-11479 with a proposed patch to the tutorial along the lines suggested by Graham. |
On 3/12/2011 7:11 PM, R. David Murray wrote:
Which is good, for people that use the tutorial. I jump straight to the |
Well, the problem with the reference is that the language reference is intended as a specification document, not a tutorial, so such a discussion does not belong there. The library reference, which does contain platform-specific and tutorial-like information simply cross links to the reference docs for raw string syntax. So adding such a note to the library docs is a bit more significant of an undertaking. Anyone want to take a crack at it? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: