This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Raw string parsing fails with backslash as last character
Type: behavior Stage:
Components: Interpreter Core Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: QuantumTim, facundobatista, georg.brandl, gwideman, r.david.murray, v+python
Priority: normal Keywords:

Created on 2007-10-12 12:52 by QuantumTim, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unnamed v+python, 2011-03-13 03:33
Messages (15)
msg56370 - (view) Author: Tim Gordon (QuantumTim) Date: 2007-10-12 12:52
If you have a raw string with a backslash as the last character, the 
parser thinks the following quote, actually used to mark the end of the 
string, is being quoted by the backslash.  For example, r'\' should be 
the string with one backslash, but...

>>> print r'\'
SyntaxError: EOL while scanning single-quoted string

There seems to have been a fix added to python 3.0 (see issue 1720390), 
but it doesn't look like it's been backtracked into any earlier version.
msg56372 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2007-10-12 12:57
As stated in the docs...
  http://docs.python.org/dev/reference/lexical_analysis.html#string-literals

  r"\" is not a valid string literal (even a raw string cannot 
  end in an odd number of backslashes).  Specifically, a raw 
  string cannot end in a single backslash (since the backslash 
  would escape the following quote character).
msg56373 - (view) Author: Tim Gordon (QuantumTim) Date: 2007-10-12 13:32
So basically raw strings are useless if you need to end a string with a 
backslash, as there is no way to quote the backslash to make it not do 
this...  This surely can't be too hard to "fix" if one considers it a 
problem (which I do), and just because even the docs say it is the 
correct behaviour, doesn't mean it should be.  Perhaps this has been 
debated before (and if so, where?), but it does seem rather odd behaviour.
msg56377 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-10-12 16:15
There's more to allowing \ at the end of a raw string: if you do that,
the raw string will end at the first quote character which is the same
as the opening one, so you can't put such a quote character into a raw
string anymore. At the moment, you can, by escaping it with a backslash,
though the backslash is left in the string.

There are basically two main uses for raw strings: Windows path names
and regular expressions. The current situation is optimal for the
latter: you can put both quote characters in a raw string, and the
backslash needed to quote the "string quote" being retained is not a
problem.
msg130442 - (view) Author: Graham Wideman (gwideman) Date: 2011-03-09 10:44
(Not clear how to reopen this issue. Hopefully my change here does that.)

OK, so as it currently stands, backslash at end of string is prohibited in the interests of allowing backslash to escape quotes that might be embedded within the string. 

But the embedded quote scenario doesn't work because the backslash remains in the string.  So the current state of play is plain broken.  

Considering:
(a) We already have the ability to use either single or double quotes around the string which gives that chance to use the other quote within the string. 
(b) The "principle of least surprise" for raw string would be to have raw mean "Never Escape Anything"
(c) backslash on end of string is a trap waiting to happen for Windows users.
...I think there is strong motivation to abandon the currently broken "backslash escapes quote" behavior and just let raw strings be totally raw.  Furthermore, it's hard to imagine that such a move would break anything.
msg130608 - (view) Author: Glenn Linderman (v+python) * Date: 2011-03-11 19:41
I can certainly agree with the opinion that raw strings are working as documented, but I can also agree with the opinion that they contain traps for the unwary, and after getting trapped several times, I have chosen to put up with the double-backslash requirement of regular strings, and avoid the use of raw strings in my code.  The double-backslash requirement of regular strings gets ugly for Windows pathnames and some regular expressions, but the traps of raw strings are more annoying that that.

I'm quite sure it would be impossible to "fix" raw strings without causing deprecation churn for people to whom they are useful (if there are any such; hard for me to imagine, but I'm sure there are).

I'm quite sure the only reasonable "fix" would be to invent a new type of "escape-free" or "exact" string (to not overuse the term raw, and make two types of raw string).  With Python 3, and UTF-8 source files, there is little need for \-prefixed characters (and there is already a string syntax that permits them, when they are needed), so it seems like inventing a new string syntax

e'string'
e"""string"""

which would not treat \ in any special manner whatsoever, would be useful for all the cases raw strings are presently useful for, and even more useful, because it would handle all the cases that are presently traps for the unwary that raw-strings have.

The problem mention in this thread of escaping the outer quote character is much more appropriately handled by the triple-quote form.  I don't know the Python history well enough to know if raw strings predated triple-quote; if they didn't, there is would have been no need for raw strings to attempt to support such.
msg130612 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-03-11 21:09
I think perhaps the language in the language reference is a bit misleading.  The purpose of the raw string algorithm is that any characters in the string be copied literally into the string object.  That is, \" "escapes" the " not so that you can write r"\"", but so that the string object produced by that literal (or by the literal r'\"') contains the two characters \".

So, it is certainly true that raw string will *not* be changed to make \ not escape quotes.

I can't remember where I read this, but the reason that a trailing \ is invalid has to do with the way the parser parses strings/raw strings.  So any alternate "more raw" string type would have to contend with the same parser issues that lead to the exiting restrictions on raw strings.

Unless someone wants to do a deep dive into the parser and figure out why things are they way they are and how to fix it, I think this issue should be reclosed.  Note that even if a solution can be found, if it significantly complicates the parser it will probably be rejected.  And I suspect it would, otherwise things probably wouldn't work the way they currently do.  However, perhaps someone can spot a clever solution that was overlooked in the original implementation.
msg130647 - (view) Author: Graham Wideman (gwideman) Date: 2011-03-12 00:56
@Glenn Linderman:  I too am usually quick to assume that "innocent fixes" may have serious unforeseen impacts, but in this case I'm not convinced.  What would matter is to enumerate the current behavior, and of that what would be changed.  You seem to have had experience with other raw-string features/gotchas -- please share! :-)

@David Murray: Excuse denseness on my part, but I'm not following the logic of your first paragraph.  I think you are saying that current raw string has to do something special to be able to contain the sequence backslash-quote, and this has the side effect of precluding that sequence appearing last in a string.  

But surely a completely-escape-free string could also contain backslash-quote just fine (assuming the string is surrounded by the other kind of quote).  So I'm thinking that the case you mention is not the driver here.  

It's conceivable there is some more complicated case where backslash-singlequote AND backslash-doublequote MUST appear literally in the same string.  However, it seems a little bizarre to worry about that case, but not worry about the simpler case of wanting both a plain singlequote and a plain doublequote in the same string.  Maybe there's some popular regular expression that calls for this complexity.

I concur that inspection of the parser (and the history and intent of this design) would be fascinating.
msg130648 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-03-12 01:20
If I'm remembering the discussion I read correctly, what the parser does is to parse the a regular string and a raw string in exactly the same way, but in the raw string case, it does not do the subsequent escape sequence replacement pass on the parsed string.  This means that it follows the "escape the quote" rule when *parsing* the string, but does not do the subsequent post-processing that would remove the \.  But because the quote-escape has a consequence at the parsing stage, it applies to both raw strings and regular strings.  Which means that an odd number of backslashes cannot appear at the end of either type of string.  And as far as I can see there is no way to fix that, since otherwise the parser can't identify the end of the string.

Therefore if no escaping is done you do, as you say, limit yourself to not being able to put both ' and " characters inside a more-raw string.  This would break regular expressions, since exactly this case does occur when using regular expressions extensively.  And a trailing backslash would never appear in a regular expression.

So, clearly raw strings are optimized for regular expression use, and not for Windows pathname use.  The proposed 'windows raw string' literal would be optimized the other way.  Adding such a literal  is python-ideas territory.

Note that windows paths can be spelled with / characters, so this specialized use case has an easy workaround, with the added advantage that a repr of such a string won't appear to have doubled \s (which I always find confusing when debugging programs involving windows path names that use the \ separator).
msg130652 - (view) Author: Glenn Linderman (v+python) * Date: 2011-03-12 02:28
@Graham: seems like the two primary gotchas are trailing \ and \" \' not removing the \.  The one that tripped me up the most was the trailing \, but I did get hit with \" once.  Probably if Python had been my first programming language that used \ escapes, it wouldn't be such a problem, but since it never will be, and because I'm forced to use the others from time to time, still, learning yet a different form of "not-quite raw" string just isn't worth my time and debug efforts.  When I first saw it, it sounded more useful than doubling the \s like I do elsewhere, but after repeated trip-ups with trailing \, I decided it wasn't for me.

@R David: Interesting description of the parsing/escaping.  Sounds like that makes for a cheap parser, fewer cases to handle.  But there is little that is hard about escape-free or exact string parsing: just look for the trailing " ' """ or ''' that matches the one at the beginning.  The only thing difficult is if you want to escape the quote, but with the rich set of quotes available, it is extremely unlikely that you can't find one that you can use, except perhaps if you are writing a parser for parsing Python strings, in which case, the regular expression that matches any leading quote could be expressed as:

'("|"""|' "'|''')"

Granted that isn't the clearest syntax in the world, but it is also uncommon, and can be assigned to a nicely named variable such as matchLeadingQuotationRE in one place, and used wherever needed.

Regarding the use of / rather that \ that is true if you are passing file names to Windows APIs, but not true if you are passing them to other programs that use / as option syntax and \ as path separator (most Windows command line utilities).
msg130653 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-03-12 02:39
Well, it's still the case that the Python raw string syntax isn't going to change to not escape the quotes, because that would break far too many existing applications that depend on them being escaped.  So a new string literal type would seem to be the only option.  I'm going to close this issue again; if either of you want to pursue the new string literal type, please bring it up on python-ideas.  Frankly, I doubt you would get much traction, but I could be wrong :)
msg130673 - (view) Author: Graham Wideman (gwideman) Date: 2011-03-12 12:36
Thanks to all for your patient comments. I think I am resigned to raw-string forever being medium-rare-string :-).

Perhaps it's obvious once you get over the initial shock of non-rawness, but workarounds for the disallowed trailing backslash  include (note the final space character):

mydir = r"C:\somedir\ ".rstrip()   or...

mydir = r"C:\somedir\ "[:-1]

It might be worth mentioning one of these in the raw string docs to emphasize that there is this gotcha, that it's easy to fix, and prompting this as an idiom that becomes familiar in applications where it's needed.
msg130721 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-03-13 03:11
I've opened issue 11479 with a proposed patch to the tutorial along the lines suggested by Graham.
msg130723 - (view) Author: Glenn Linderman (v+python) * Date: 2011-03-13 03:33
On 3/12/2011 7:11 PM, R. David Murray wrote:
> R. David Murray<rdmurray@bitdance.com>  added the comment:
>
> I've opened issue 11479 with a proposed patch to the tutorial along the lines suggested by Graham.

Which is good, for people that use the tutorial.  I jump straight to the 
reference guide, usually, because of so many years of experience with 
other languages.  But I was surprised you used .strip() instead of [:-1] 
which is shorter and I would expect it to be more efficient also.
msg130749 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-03-13 15:33
Well, the problem with the reference is that the language reference is intended as a specification document, not a tutorial, so such a discussion does not belong there.  The library reference, which does contain platform-specific and tutorial-like information simply cross links to the reference docs for raw string syntax.  So adding such a note to the library docs is a bit more significant of an undertaking.  Anyone want to take a crack at it?
History
Date User Action Args
2022-04-11 14:56:27adminsetgithub: 45612
2011-03-13 15:33:22r.david.murraysetnosy: georg.brandl, facundobatista, QuantumTim, v+python, r.david.murray, gwideman
messages: + msg130749
2011-03-13 03:33:23v+pythonsetfiles: + unnamed

messages: + msg130723
nosy: georg.brandl, facundobatista, QuantumTim, v+python, r.david.murray, gwideman
2011-03-13 03:11:06r.david.murraysetnosy: georg.brandl, facundobatista, QuantumTim, v+python, r.david.murray, gwideman
messages: + msg130721
2011-03-12 12:36:59gwidemansetnosy: georg.brandl, facundobatista, QuantumTim, v+python, r.david.murray, gwideman
messages: + msg130673
2011-03-12 02:39:55r.david.murraysetstatus: open -> closed
nosy: georg.brandl, facundobatista, QuantumTim, v+python, r.david.murray, gwideman
messages: + msg130653
2011-03-12 02:28:37v+pythonsetnosy: georg.brandl, facundobatista, QuantumTim, v+python, r.david.murray, gwideman
messages: + msg130652
2011-03-12 01:20:36r.david.murraysetnosy: georg.brandl, facundobatista, QuantumTim, v+python, r.david.murray, gwideman
messages: + msg130648
2011-03-12 00:56:42gwidemansetnosy: georg.brandl, facundobatista, QuantumTim, v+python, r.david.murray, gwideman
messages: + msg130647
2011-03-11 21:09:10r.david.murraysetnosy: + r.david.murray

messages: + msg130612
versions: - Python 2.6, Python 2.5, Python 3.1
2011-03-11 19:41:15v+pythonsetnosy: + v+python
messages: + msg130608
2011-03-09 10:51:29ezio.melottilinkissue11451 superseder
2011-03-09 10:50:55ezio.melottisetstatus: closed -> open
2011-03-09 10:44:21gwidemansetversions: + Python 2.6, Python 3.1, Python 2.7, Python 3.2, Python 3.3, - Python 2.4
nosy: + gwideman

messages: + msg130442

type: behavior
2007-10-12 16:15:25georg.brandlsetnosy: + georg.brandl
messages: + msg56377
2007-10-12 13:32:21QuantumTimsetmessages: + msg56373
2007-10-12 12:57:58facundobatistasetstatus: open -> closed
resolution: not a bug
messages: + msg56372
nosy: + facundobatista
2007-10-12 12:52:37QuantumTimcreate