Title: Doc strings no longer stored in body of AST
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.7
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, benjamin.peterson, georg.brandl, inada.naoki, lukasz.langa, ned.deily, rhettinger, serhiy.storchaka, vstinner
Priority: Keywords: 3.7regression

Created on 2018-02-22 12:20 by Mark.Shannon, last changed 2018-03-18 20:46 by ned.deily. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 5927 closed inada.naoki, 2018-02-27 11:58
Messages (24)
msg312557 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2018-02-22 12:20
Python 3.7.0b1+ (heads/3.7:dfa1144, Feb 22 2018, 12:10:59) 

>>> m = ast.parse("def foo():\n   'help'")
>>> m.body[0].body

Correct behaviour (3.6 or earlier)

>>> m = ast.parse("def foo():\n   'help'")
>>> m.body[0].body
[<_ast.Expr object at 0x7fb9eeb1d4e0>]
msg312558 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-22 12:28
AST is changed slightly from Python 3.7.
ast.get_docstring() works for both of 3.6 and 3.7.

$ ./python
Python 3.8.0a0 (heads/master-dirty:451d1edaf4, Feb 22 2018, 21:11:54)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ast
>>> m = ast.parse("def foo():\n   'help'")
>>> ast.get_docstring(m.body[0])
msg312559 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2018-02-22 12:34
This is an unnecessary and breaking change.
Changes like this should not be made unless necessary to fix a bug.
msg312561 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-22 12:39
msg312562 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2018-02-22 12:47
That issue has to do with bytecode generation, not the AST.

The AST should be an accurate representation of the Python source code.
Making a better representation of the source would be fine, but this makes it worse.

Doc-strings may be semantically distinct from other expressions, but they are syntactically the same.
msg312564 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-22 13:11
We implemented AST-level constant folding (#29469) based on the AST change (#29463).

I feel it's possible to revert AST change without reverting AST constant folding:

* Docstring is exposed from AST before calling AST optimizer.
* AST optimizer removes unused constant.

But I don't remember how difficult it is because since this change is made one year ago (GH-46)...
msg312565 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-22 13:15
This isn't a bug, but a feature. You no longer need to check and skip the first statement if it is a literal string. The body attribute now always represents a sequence of statements, and the docstring attribute represents a docstring.
msg312567 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2018-02-22 14:14
Stating that "this is a feature not a bug" does not make it so.
This breaks existing code and reduces the capabilities of the `ast` module.
For example, how does one get the location of the docstring now?

From a syntactic point of view.
def foo():
def foo():
barely differ.
The S in AST stands for Syntax not Semantics.
msg312580 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-02-22 18:42
It seems we have a difference of opinion here.  Serhiy closed this issue so, Mark, if you feel strongly enough to pursue it, you should reopen it and solicit other opinions.  The clock has just about run out to change the now current behavior for 3.7.0.
msg312588 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-22 19:24
It may be worth to include docstrings explicitly in the grammar.
msg312625 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2018-02-23 10:16
Serhiy, thanks for reopening this issue.

It seems to be that there are three reasonable choices:
1. Revert to 3.6 behaviour, with the addition of `docstring` attribute.
2. Change the docstring attribute to an AST node, possibly by modifying the grammar.
3. Do nothing.

I would prefer 1, as it requires no changes to 3rd party code and doesn't present an additional obstacle when porting from Python 2.

2 would be acceptable, as it allows tools to easily convert the body back to its 3.6 form (or vice-versa)

3 is a pain as it involves re-tokenizing the file to get the location of the doc-string.
msg312629 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-23 11:19
I prefer 2 or 3. There are benefits from representing a docstring as a separate attribute. This simplifies the code for walking the AST tree (no longer special cases for skipping docstring in modules, classes and functions) and the code for retrieving a docstring. It solves the problem that an expression resulting to a constant string (like "a"+"b" or f"a") shouldn't be interpreted as docstrings.

The position of a docstring can be useful for determining the position of fragments inside a docstring (for example for doctests). Several active developed third-party libraries (like pyflakes, see already updated their code for supporting 3.7. The position of nodes preceding or following a docstring could be used. This is not perfect, it doesn't work with empty lines before or after docstring, but it never was perfect due to escaped newlines in string literals and line continuations.
msg312630 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-23 11:29
Note, that getting the location of the dostring already was a pain since in case of multiline string literals CPython saves the location of the last line, and PyPy saves the location of the first line. Getting the location of the specific fragment of the docstring is even larger pain as shown in following examples:

    def f()
        >>> f()

    def f()
        >>> f()

    def f()
        ">>> f()"\
msg312856 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-02-25 20:41
Is this going to get resolved in time for 3.7.0b2?  If we need to change this behavior, I really do not want to delay this for b3 which is when the 3.7.0 ABI freezes; otherwise such a major behavior change would likely need to wait until 3.8.
msg312989 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-27 09:38
I tried this patch:

diff --git a/Python/ast.c b/Python/ast.c
index e2092f0f85..93be2bc839 100644
--- a/Python/ast.c
+++ b/Python/ast.c
@@ -3537,9 +3537,9 @@ docstring_from_stmts(asdl_seq *stmts)
         if (s->kind == Expr_kind && s->v.Expr.value->kind == Str_kind) {
             string doc = s->v.Expr.value->v.Str.s;
             /* not very efficient, but simple */
-            memmove(&asdl_seq_GET(stmts, 0), &asdl_seq_GET(stmts, 1),
-                    (stmts->size - 1) * sizeof(void*));
-            stmts->size--;
+            //memmove(&asdl_seq_GET(stmts, 0), &asdl_seq_GET(stmts, 1),
+            //        (stmts->size - 1) * sizeof(void*));
+            //stmts->size--;
             return doc;

But I got "SyntaxError: from __future__ imports must occur at the beginning of the file".

docstring is very special while it looks like just a single string literal statement...
msg312999 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-27 12:01
I'm implementing (2).  Please check GH-5927.
msg313000 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-27 12:17
Would be nice to discuss this design question on Python-list.
msg313001 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-02-27 12:45
Python-list? -dev? -ideas?
msg313002 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-02-27 12:51
Sorry, Python-Dev, of course.
msg313005 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-02-27 13:17
Since we are already past the 3.7.0b2 cutoff time and there does not seen to be a consensus that the current 3.7 behavior needs to change and the proposed change is quite large, I think we should not change anything now for b2.  You can have a discussion on python-dev and, if there is a consensus for change, we can look at it for b3.
msg313315 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-03-06 10:19
There are no enough +1 for merging GH-5927 yet on the ML discussion.

@Mark.Shannon, would you comment your opinion to the ML thread?

For now, I don't want to break tools supporting Python 3.7 a1~b2 already.
Although I will select (2) if there is time machine...
msg313698 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-03-12 20:43
Python's AST already doesn't include a lot of syntax that has no runtime effect:
- comments;
- whitespace and punctuation.

The situation is now made "worse" with constant folding.  You won't have full line information on the full expression, you don't even fully *see* expressions as syntactically provided by user code.  The docstring is just a special case of that.

As a counter example, it still includes function-level type annotations which are not evaluated. Clearly we don't have an explicit rule about this.

If you asked *me*, I would say Python provides a "lossy" AST and this is the one it itself uses.  It also provides a "lossless" AST and this one is in lib2to3.  It's unfortunate that the latter is not documented and kind of neglected.  This is something I want to address soonish.

Summing up, I'm in favor of option 3.  The AST preserves the docstring, it simply doesn't have information on where exactly it lives.  In exchange, the tree makes it easier to *find* the docstring and easier to find the first instruction (as Serhiy points out, the first statement is now always code to be executed).

Quoting literally the second sentence of the `ast` documentation:
> The abstract syntax itself might change with each Python release.

I am not against Option 2 since it's already implemented but I feel it's a pretty big change, pretty late in the release process for 3.7.  The amount of work required *just* to preserve the line number of the docstring (which can be reasonably guessed anyway) seems silly to me.
msg313997 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2018-03-17 12:23
I still think that option 2 would be best, but given that it is late in the release cycle, I will accept the status quo.

A couple of comments for the record:

Constant folding occurs after AST creation, so doesn't make the AST worse.

The parse tree created by lib2to3 is a concrete parse tree, not an AST. However, it might would sense (as Łukasz suggests) to use it as a basis for an AST designed for tooling and leave the AST generated by the C parser for bytecode generation.

Happy to close this issue now, unless anyone else has something to add.
msg314056 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-03-18 20:46
Thanks everyone for your inputs.  Let's go with the status quo -> closing this issue.
Date User Action Args
2018-03-18 20:46:22ned.deilysetstatus: open -> closed
priority: deferred blocker ->
messages: + msg314056

resolution: wont fix
stage: resolved
2018-03-17 12:23:31Mark.Shannonsetmessages: + msg313997
2018-03-12 20:43:58lukasz.langasetnosy: + lukasz.langa
messages: + msg313698
2018-03-06 10:19:00inada.naokisetmessages: + msg313315
2018-02-27 13:17:55ned.deilysetpriority: release blocker -> deferred blocker

messages: + msg313005
2018-02-27 12:51:25serhiy.storchakasetmessages: + msg313002
2018-02-27 12:45:51inada.naokisetmessages: + msg313001
2018-02-27 12:17:45serhiy.storchakasetmessages: + msg313000
2018-02-27 12:01:35inada.naokisetkeywords: - patch

messages: + msg312999
stage: patch review -> (no value)
2018-02-27 11:58:32inada.naokisetkeywords: + patch
stage: patch review
pull_requests: + pull_request5698
2018-02-27 09:38:04inada.naokisetmessages: + msg312989
2018-02-25 20:41:55ned.deilysetpriority: normal -> release blocker

messages: + msg312856
2018-02-23 15:35:19ppperrysettitle: Doc strings omitted from AST -> Doc strings no longer stored in body of AST
2018-02-23 11:29:44serhiy.storchakasetmessages: + msg312630
2018-02-23 11:19:43serhiy.storchakasetmessages: + msg312629
2018-02-23 10:16:55Mark.Shannonsetmessages: + msg312625
2018-02-22 19:24:41serhiy.storchakasetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg312588

stage: resolved -> (no value)
2018-02-22 18:42:44ned.deilysetnosy: + ned.deily
messages: + msg312580
2018-02-22 14:14:24Mark.Shannonsetmessages: + msg312567
2018-02-22 13:15:32serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg312565

stage: resolved
2018-02-22 13:11:21inada.naokisetcomponents: + Interpreter Core, - Library (Lib)
2018-02-22 13:11:07inada.naokisetnosy: + georg.brandl, rhettinger, vstinner, benjamin.peterson, serhiy.storchaka
messages: + msg312564
2018-02-22 12:47:02Mark.Shannonsetmessages: + msg312562
2018-02-22 12:39:49inada.naokisetmessages: + msg312561
2018-02-22 12:34:51Mark.Shannonsetmessages: + msg312559
2018-02-22 12:28:52inada.naokisetnosy: + inada.naoki
messages: + msg312558
2018-02-22 12:20:07Mark.Shannoncreate