This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Tweak doctest 'example' regex to allow a leading ellipsis in 'want' line
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: bskinn, paul.moore
Priority: normal Keywords:

Created on 2019-04-24 16:22 by bskinn, last changed 2022-04-11 14:59 by admin.

Messages (6)
msg340788 - (view) Author: Brian Skinn (bskinn) * Date: 2019-04-24 16:22
doctest requires code examples have PS1 as ">>> " and PS2 as "... " -- that is, each is three printed characters, followed by a space:

```
$ cat ell_err.py
import doctest

class Foo:
    """Test docstring.

    >>>print("This is a test sentence.")
    ...a test...

    """

doctest.run_docstring_examples(
    Foo(),
    {},
    optionflags=doctest.ELLIPSIS,
)

$ python3.8 --version
Python 3.8.0a3
$ python3.8 ell_err.py
Traceback (most recent call last):
    ...
ValueError: line 3 of the docstring for NoName lacks blank after >>>: '    >>>print("This is a test sentence.")'


$ cat ell_print.py
import doctest

class Foo:
    """Test docstring.

    >>> print("This is a test sentence.")
    ...a test...

    """

doctest.run_docstring_examples(
    Foo(),
    {},
    optionflags=doctest.ELLIPSIS,
)

$ python3.8 ell_print.py
Traceback (most recent call last):
    ...
ValueError: line 4 of the docstring for NoName lacks blank after ...: '    ...a test...'

```

AFAICT, this behavior is consistent across 3.4.10, 3.5.7, 3.6.8, 3.7.3, and 3.8.0a3.


**However**, in this `ell_print.py` above, that "PS2" line isn't actually meant to be a continuation of the 'source' portion of the example; it's meant to be the *output* (the 'want') of the example, with a leading ellipsis to be matched per `doctest.ELLIPSIS` rules.

The regex currently used to look for the 'source' of an example is (https://github.com/python/cpython/blob/4f5a3493b534a95fbb01d593b1ffe320db6b395e/Lib/doctest.py#L583-L586):

```
(?P<source>
    (?:^(?P<indent> [ ]*) >>>    .*)    # PS1 line
    (?:\n           [ ]*  \.\.\. .*)*)  # PS2 lines
\n?
```

Since this pattern is compiled with re.VERBOSE (https://github.com/python/cpython/blob/4f5a3493b534a95fbb01d593b1ffe320db6b395e/Lib/doctest.py#L592), the space-as-fourth-character in PS1/PS2 is not explicitly matched.

I propose changing the regex to:

```
(?P<source>
    (?:^(?P<indent> [ ]*) >>>[ ]    .*)    # PS1 line
    (?:\n           [ ]*  \.\.\.[ ] .*)*)  # PS2 lines
\n?
```

This will then *explicitly* match the trailing space of PS1; it *shouldn't* break any existing doctests, because the parsing code lower down has already been requiring that space to be present in PS1, as shown for `ell_err.py` above.

This will also require an *explicit trailing space* to be present in order for a line starting with three periods to be interpreted as a PS2 line of 'source'; otherwise, it will be treated as part of the 'want'.  I made this change in my local user install of 3.8's doctest.py, and it works as I expect on `ell_print.py`, passing the test:

```
$ python3.8 ell_print.py
$
$ cat ell_wrongprint.py
import doctest

class Foo:
    """Test docstring.

    >>> print("This is a test sentence.")
    ...a foo test...

    """

doctest.run_docstring_examples(
    Foo(),
    {},
    optionflags=doctest.ELLIPSIS,
)

$ python3.8 ell_wrongprint.py
**********************************************************************
File "ell_wrongprint.py", line ?, in NoName
Failed example:
    print("This is a test sentence.")
Expected:
    ...a foo test...
Got:
    This is a test sentence.

```

For completeness, the following piece of regex in the 'want' section (https://github.com/python/cpython/blob/4f5a3493b534a95fbb01d593b1ffe320db6b395e/Lib/doctest.py#L589):

```
    (?![ ]*>>>)  # Not a line starting with PS1
```

should probably also be changed to:

```
    (?![ ]*>>>[ ])  # Not a line starting with PS1
```


I would be happy to put together a PR for this; I would plan to take a ~TDD style approach, implementing a few tests first and then making the regex change.
msg340918 - (view) Author: Brian Skinn (bskinn) * Date: 2019-04-26 14:10
Ahh, this *will* break some doctests: any with blank PS2 lines in the 'source' portion without the explicit trailing space:

1] >>> def foo():
2] ...    print("bar")
3] ...
4] ...    print("baz")
5] >>> foo()
6] bar
7] baz

If line 3 contains exactly "..." instead of starting with "... ", it will not be recognized as a PS2 line and the example will be parsed as:

'source'
>>> def foo():
...    print("bar")

'want'
...
...    print("baz")

IMO this isn't a *terribly* unreasonable tradeoff, though -- it would enable the specific ellipsis use-case as in the OP, at the cost of breaking some doctests, which shouldn't(?) be in any critical paths?
msg349289 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2019-08-09 14:21
It shouldn't be hard to update the regex to accept either "... " followed by other text or "..." on a line on its own, surely?
msg349304 - (view) Author: Brian Skinn (bskinn) * Date: 2019-08-09 18:06
Mm, agreed--that regex wouldn't be hard to write.

The problem is, AFAICT there's irresolvable syntactic ambiguity in a line starting with exactly three periods, if the doctest PS2 specification is not constrained to be exactly "... ". In such a case, "..." could mark either (1) an ellipsis standing in for an entire line of 'want', or (2) a PS2, marking a blank line in 'source'.

I don't really think aggressive lookahead would help much -- an arbitrary number of following lines could contain exactly "...", and the intended transition from 'source' to 'want' could lie at any one of them.  The nonrecursive nature of regex is unhelpful here, but I don't think one could even write a recursive-descent parser, or similar, that could be 100% reliable on a single comparison. It would have to test the string against all the various splits between 'source' and 'want' along those "..." lines, and see if any match. Hairy mess.

AFAICT, defining "... " as PS2, and "..." as 'ellipsis representing a whole line' is the cleanest solution from a logical point of view.

Of course, then it's *visually* confusing, because trailing space. ¯\_(ツ)_/¯
msg349308 - (view) Author: Brian Skinn (bskinn) * Date: 2019-08-09 18:41
I suppose one alternative solution might be to tweak the ELLIPSIS feature of doctest, such that it would interpret a run of >=3 periods in a row (matching regex pattern of "[.]{3,}") as 'ellipsis'.

The regex for PS2 could then have a negative lookahead added, so that it *only* matches three periods, plus optionally other content: '\.\.\.(?!\.)'

That way, a line like "... foo" would retain the current meaning of "'source' line, consisting of PS2 plus the identifier 'foo'", but the meaning of "arbitrary content followed by ' foo'" could be achieved by ".... foo", since the leading "...." would NOT match the negative lookahead for PS2.

In other situations, where "..." is *not* the leading non-whitespace content, the old behavior suffices: the PS2 regex won't match anyways, so it'll be left for ELLIPSIS to process.
msg349327 - (view) Author: Brian Skinn (bskinn) * Date: 2019-08-10 03:21
On reflection, it would probably be better to limit the ELLIPSIS to 3 or 4 periods ('[.]{3,4}'); otherwise, it would be impossible to express an ellipsis followed by a period in a 'want'.
History
Date User Action Args
2022-04-11 14:59:14adminsetgithub: 80895
2019-08-10 03:21:57bskinnsetmessages: + msg349327
2019-08-09 18:41:15bskinnsetmessages: + msg349308
2019-08-09 18:06:25bskinnsetmessages: + msg349304
2019-08-09 14:21:22paul.mooresetnosy: + paul.moore
messages: + msg349289
2019-04-26 14:10:22bskinnsetmessages: + msg340918
2019-04-24 16:22:38bskinncreate