Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw string parsing fails with backslash as last character #45612

Closed
QuantumTim mannequin opened this issue Oct 12, 2007 · 15 comments
Closed

Raw string parsing fails with backslash as last character #45612

QuantumTim mannequin opened this issue Oct 12, 2007 · 15 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@QuantumTim
Copy link
Mannequin

QuantumTim mannequin commented Oct 12, 2007

BPO 1271
Nosy @birkenfeld, @facundobatista, @bitdancer
Files
  • unnamed
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-03-12.02:39:55.654>
    created_at = <Date 2007-10-12.12:52:37.209>
    labels = ['interpreter-core', 'type-bug', 'invalid']
    title = 'Raw string parsing fails with backslash as last character'
    updated_at = <Date 2011-03-13.15:33:22.772>
    user = 'https://bugs.python.org/QuantumTim'

    bugs.python.org fields:

    activity = <Date 2011-03-13.15:33:22.772>
    actor = 'r.david.murray'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-03-12.02:39:55.654>
    closer = 'r.david.murray'
    components = ['Interpreter Core']
    creation = <Date 2007-10-12.12:52:37.209>
    creator = 'QuantumTim'
    dependencies = []
    files = ['21098']
    hgrepos = []
    issue_num = 1271
    keywords = []
    message_count = 15.0
    messages = ['56370', '56372', '56373', '56377', '130442', '130608', '130612', '130647', '130648', '130652', '130653', '130673', '130721', '130723', '130749']
    nosy_count = 6.0
    nosy_names = ['georg.brandl', 'facundobatista', 'QuantumTim', 'v+python', 'r.david.murray', 'gwideman']
    pr_nums = []
    priority = 'normal'
    resolution = 'not a bug'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue1271'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @QuantumTim
    Copy link
    Mannequin Author

    QuantumTim mannequin commented Oct 12, 2007

    If you have a raw string with a backslash as the last character, the
    parser thinks the following quote, actually used to mark the end of the
    string, is being quoted by the backslash. For example, r'\' should be
    the string with one backslash, but...

    >>> print r'\'
    SyntaxError: EOL while scanning single-quoted string

    There seems to have been a fix added to python 3.0 (see bpo-1720390),
    but it doesn't look like it's been backtracked into any earlier version.

    @QuantumTim QuantumTim mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Oct 12, 2007
    @facundobatista
    Copy link
    Member

    As stated in the docs...
    http://docs.python.org/dev/reference/lexical_analysis.html#string-literals

    r"\" is not a valid string literal (even a raw string cannot
    end in an odd number of backslashes). Specifically, a raw
    string cannot end in a single backslash (since the backslash
    would escape the following quote character).

    @QuantumTim
    Copy link
    Mannequin Author

    QuantumTim mannequin commented Oct 12, 2007

    So basically raw strings are useless if you need to end a string with a
    backslash, as there is no way to quote the backslash to make it not do
    this... This surely can't be too hard to "fix" if one considers it a
    problem (which I do), and just because even the docs say it is the
    correct behaviour, doesn't mean it should be. Perhaps this has been
    debated before (and if so, where?), but it does seem rather odd behaviour.

    @birkenfeld
    Copy link
    Member

    There's more to allowing \ at the end of a raw string: if you do that,
    the raw string will end at the first quote character which is the same
    as the opening one, so you can't put such a quote character into a raw
    string anymore. At the moment, you can, by escaping it with a backslash,
    though the backslash is left in the string.

    There are basically two main uses for raw strings: Windows path names
    and regular expressions. The current situation is optimal for the
    latter: you can put both quote characters in a raw string, and the
    backslash needed to quote the "string quote" being retained is not a
    problem.

    @gwideman
    Copy link
    Mannequin

    gwideman mannequin commented Mar 9, 2011

    (Not clear how to reopen this issue. Hopefully my change here does that.)

    OK, so as it currently stands, backslash at end of string is prohibited in the interests of allowing backslash to escape quotes that might be embedded within the string.

    But the embedded quote scenario doesn't work because the backslash remains in the string. So the current state of play is plain broken.

    Considering:
    (a) We already have the ability to use either single or double quotes around the string which gives that chance to use the other quote within the string.
    (b) The "principle of least surprise" for raw string would be to have raw mean "Never Escape Anything"
    (c) backslash on end of string is a trap waiting to happen for Windows users.
    ...I think there is strong motivation to abandon the currently broken "backslash escapes quote" behavior and just let raw strings be totally raw. Furthermore, it's hard to imagine that such a move would break anything.

    @gwideman gwideman mannequin added the type-bug An unexpected behavior, bug, or error label Mar 9, 2011
    @ezio-melotti ezio-melotti reopened this Mar 9, 2011
    @vpython
    Copy link
    Mannequin

    vpython mannequin commented Mar 11, 2011

    I can certainly agree with the opinion that raw strings are working as documented, but I can also agree with the opinion that they contain traps for the unwary, and after getting trapped several times, I have chosen to put up with the double-backslash requirement of regular strings, and avoid the use of raw strings in my code. The double-backslash requirement of regular strings gets ugly for Windows pathnames and some regular expressions, but the traps of raw strings are more annoying that that.

    I'm quite sure it would be impossible to "fix" raw strings without causing deprecation churn for people to whom they are useful (if there are any such; hard for me to imagine, but I'm sure there are).

    I'm quite sure the only reasonable "fix" would be to invent a new type of "escape-free" or "exact" string (to not overuse the term raw, and make two types of raw string). With Python 3, and UTF-8 source files, there is little need for \-prefixed characters (and there is already a string syntax that permits them, when they are needed), so it seems like inventing a new string syntax

    e'string'
    e"""string"""

    which would not treat \ in any special manner whatsoever, would be useful for all the cases raw strings are presently useful for, and even more useful, because it would handle all the cases that are presently traps for the unwary that raw-strings have.

    The problem mention in this thread of escaping the outer quote character is much more appropriately handled by the triple-quote form. I don't know the Python history well enough to know if raw strings predated triple-quote; if they didn't, there is would have been no need for raw strings to attempt to support such.

    @bitdancer
    Copy link
    Member

    I think perhaps the language in the language reference is a bit misleading. The purpose of the raw string algorithm is that any characters in the string be copied literally into the string object. That is, \" "escapes" the " not so that you can write r"\"", but so that the string object produced by that literal (or by the literal r'\"') contains the two characters \".

    So, it is certainly true that raw string will *not* be changed to make \ not escape quotes.

    I can't remember where I read this, but the reason that a trailing \ is invalid has to do with the way the parser parses strings/raw strings. So any alternate "more raw" string type would have to contend with the same parser issues that lead to the exiting restrictions on raw strings.

    Unless someone wants to do a deep dive into the parser and figure out why things are they way they are and how to fix it, I think this issue should be reclosed. Note that even if a solution can be found, if it significantly complicates the parser it will probably be rejected. And I suspect it would, otherwise things probably wouldn't work the way they currently do. However, perhaps someone can spot a clever solution that was overlooked in the original implementation.

    @gwideman
    Copy link
    Mannequin

    gwideman mannequin commented Mar 12, 2011

    @Glenn Linderman: I too am usually quick to assume that "innocent fixes" may have serious unforeseen impacts, but in this case I'm not convinced. What would matter is to enumerate the current behavior, and of that what would be changed. You seem to have had experience with other raw-string features/gotchas -- please share! :-)

    @david Murray: Excuse denseness on my part, but I'm not following the logic of your first paragraph. I think you are saying that current raw string has to do something special to be able to contain the sequence backslash-quote, and this has the side effect of precluding that sequence appearing last in a string.

    But surely a completely-escape-free string could also contain backslash-quote just fine (assuming the string is surrounded by the other kind of quote). So I'm thinking that the case you mention is not the driver here.

    It's conceivable there is some more complicated case where backslash-singlequote AND backslash-doublequote MUST appear literally in the same string. However, it seems a little bizarre to worry about that case, but not worry about the simpler case of wanting both a plain singlequote and a plain doublequote in the same string. Maybe there's some popular regular expression that calls for this complexity.

    I concur that inspection of the parser (and the history and intent of this design) would be fascinating.

    @bitdancer
    Copy link
    Member

    If I'm remembering the discussion I read correctly, what the parser does is to parse the a regular string and a raw string in exactly the same way, but in the raw string case, it does not do the subsequent escape sequence replacement pass on the parsed string. This means that it follows the "escape the quote" rule when *parsing* the string, but does not do the subsequent post-processing that would remove the \. But because the quote-escape has a consequence at the parsing stage, it applies to both raw strings and regular strings. Which means that an odd number of backslashes cannot appear at the end of either type of string. And as far as I can see there is no way to fix that, since otherwise the parser can't identify the end of the string.

    Therefore if no escaping is done you do, as you say, limit yourself to not being able to put both ' and " characters inside a more-raw string. This would break regular expressions, since exactly this case does occur when using regular expressions extensively. And a trailing backslash would never appear in a regular expression.

    So, clearly raw strings are optimized for regular expression use, and not for Windows pathname use. The proposed 'windows raw string' literal would be optimized the other way. Adding such a literal is python-ideas territory.

    Note that windows paths can be spelled with / characters, so this specialized use case has an easy workaround, with the added advantage that a repr of such a string won't appear to have doubled \s (which I always find confusing when debugging programs involving windows path names that use the \ separator).

    @vpython
    Copy link
    Mannequin

    vpython mannequin commented Mar 12, 2011

    @graham: seems like the two primary gotchas are trailing \ and \" \' not removing the \. The one that tripped me up the most was the trailing \, but I did get hit with \" once. Probably if Python had been my first programming language that used \ escapes, it wouldn't be such a problem, but since it never will be, and because I'm forced to use the others from time to time, still, learning yet a different form of "not-quite raw" string just isn't worth my time and debug efforts. When I first saw it, it sounded more useful than doubling the \s like I do elsewhere, but after repeated trip-ups with trailing \, I decided it wasn't for me.

    @r David: Interesting description of the parsing/escaping. Sounds like that makes for a cheap parser, fewer cases to handle. But there is little that is hard about escape-free or exact string parsing: just look for the trailing " ' """ or ''' that matches the one at the beginning. The only thing difficult is if you want to escape the quote, but with the rich set of quotes available, it is extremely unlikely that you can't find one that you can use, except perhaps if you are writing a parser for parsing Python strings, in which case, the regular expression that matches any leading quote could be expressed as:

    '("|"""|' "'|''')"

    Granted that isn't the clearest syntax in the world, but it is also uncommon, and can be assigned to a nicely named variable such as matchLeadingQuotationRE in one place, and used wherever needed.

    Regarding the use of / rather that \ that is true if you are passing file names to Windows APIs, but not true if you are passing them to other programs that use / as option syntax and \ as path separator (most Windows command line utilities).

    @bitdancer
    Copy link
    Member

    Well, it's still the case that the Python raw string syntax isn't going to change to not escape the quotes, because that would break far too many existing applications that depend on them being escaped. So a new string literal type would seem to be the only option. I'm going to close this issue again; if either of you want to pursue the new string literal type, please bring it up on python-ideas. Frankly, I doubt you would get much traction, but I could be wrong :)

    @gwideman
    Copy link
    Mannequin

    gwideman mannequin commented Mar 12, 2011

    Thanks to all for your patient comments. I think I am resigned to raw-string forever being medium-rare-string :-).

    Perhaps it's obvious once you get over the initial shock of non-rawness, but workarounds for the disallowed trailing backslash include (note the final space character):

    mydir = r"C:\somedir\ ".rstrip()   or...
    
    mydir = r"C:\somedir\ "[:-1]

    It might be worth mentioning one of these in the raw string docs to emphasize that there is this gotcha, that it's easy to fix, and prompting this as an idiom that becomes familiar in applications where it's needed.

    @bitdancer
    Copy link
    Member

    I've opened bpo-11479 with a proposed patch to the tutorial along the lines suggested by Graham.

    @vpython
    Copy link
    Mannequin

    vpython mannequin commented Mar 13, 2011

    On 3/12/2011 7:11 PM, R. David Murray wrote:

    R. David Murray<rdmurray@bitdance.com> added the comment:

    I've opened bpo-11479 with a proposed patch to the tutorial along the lines suggested by Graham.

    Which is good, for people that use the tutorial. I jump straight to the
    reference guide, usually, because of so many years of experience with
    other languages. But I was surprised you used .strip() instead of [:-1]
    which is shorter and I would expect it to be more efficient also.

    @bitdancer
    Copy link
    Member

    Well, the problem with the reference is that the language reference is intended as a specification document, not a tutorial, so such a discussion does not belong there. The library reference, which does contain platform-specific and tutorial-like information simply cross links to the reference docs for raw string syntax. So adding such a note to the library docs is a bit more significant of an undertaking. Anyone want to take a crack at it?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants