This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: ast.parse outputs ast.Strs which do not differentiate between the ASCII codepoint 12 (literal new line) and the ASCII codepoints 134 and 156 ("\n")
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.8, Python 2.7
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, hawkowl, mark.dickinson, mbussonn
Priority: normal Keywords:

Created on 2019-05-14 02:02 by hawkowl, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (6)
msg342417 - (view) Author: Amber Brown (hawkowl) * Date: 2019-05-14 02:02
reproducing case:

Hello \n blah.

And then in a REPL (2.7 or 3+):

>>> import ast
>>> f = ast.parse(open("", 'rb').read())
>>> f
<_ast.Module object at 0x7f609d0a4d68>
>>> f.body[0]
<_ast.Expr object at 0x7f609d0a4e10>
>>> f.body[0].value
<_ast.Str object at 0x7f609d02b780>
>>> f.body[0].value.s
'\nHello \n blah.\n'
>>> repr(f.body[0].value.s)
"'\\nHello \\n blah.\\n'"

Expected behaviour:
>>> repr(f.body[0].value.s)
"'\\nHello \\\\n blah.\\n'"
msg342422 - (view) Author: Matthias Bussonnier (mbussonn) * Date: 2019-05-14 02:54
I believe this one is even before the ast, in the tokenizer. Though the AST is also doing some normalisation in identifiers (“ε” U+03B5 Greek Small Letter Epsilon Unicode Character , and “ϵ” U+03F5 Greek Lunate Epsilon Symbol Unicode Character get normalized to the same for example, which is problematic as the look different, but end up being same identifier).

I'd be interested in an opt-in flag to not do this normalisation (I have a prototype with this for the identifier normalisation in ast, but I have not looked at the tokenizer), which might be useful for some linting tools.
msg342511 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2019-05-14 20:11
The existing behavior is what I'd expect.

Using python3:

>>> import ast
>>> s = open('', 'rb').read()
>>> s
b'"""\nHello \\n blah.\n"""\n'
>>> ast.dump(ast.parse(s))
"Module(body=[Expr(value=Str(s='\\nHello \\n blah.\\n'))])"
>>> eval(s)
'\nHello \n blah.\n'

As always with the AST, some information is lost. It's not designed to be able to round-trip back to the source text.
msg342514 - (view) Author: Amber Brown (hawkowl) * Date: 2019-05-14 20:26
There's a difference between round-tripping back to the source text and correctly representing the text in the source, though.

Since I'm using this module to perform static analysis of a Python module to retrieve class/function definitions and their docstrings to create API documentation, the string being the same as what it is in the file is important to me.
msg342519 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-05-14 21:00
The AST _does_ correctly represent the Python string object in the source, though. After:

>>> s = """
... Hello \n world
... """

we have a Python object `s` of type `str`, which contains exactly three newlines, zero "n" characters, and zero backslashes. So:

>>> s == '\nHello \n world\n'

If the AST Str node value were '\nHello \\\n world\n' as you suggest, that would represent a different string to `s`: one containing two newline characters, one "n" and one backslash.

If you need to operate directly on the source as text, then the AST representation probably isn't what you want.
msg342524 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2019-05-14 23:12
I agree with Mark: the string is being correctly interpreted by the AST parser, per Python's tokenizer rules.

You might want to look at lib2to3, which I think is also used by black. It's also possible that mypy or another static analyzer would be using some library you can leverage.
Date User Action Args
2022-04-11 14:59:15adminsetgithub: 81092
2019-05-14 23:12:53eric.smithsetstatus: open -> closed
type: behavior
messages: + msg342524

resolution: not a bug
stage: resolved
2019-05-14 21:00:29mark.dickinsonsetnosy: + mark.dickinson
messages: + msg342519
2019-05-14 20:26:39hawkowlsetmessages: + msg342514
2019-05-14 20:11:38eric.smithsetnosy: + eric.smith
messages: + msg342511
2019-05-14 02:54:30mbussonnsetnosy: + mbussonn
messages: + msg342422
2019-05-14 02:02:23hawkowlcreate