This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: correct and clarify str.splitlines() documentation
Type: Stage: resolved
Components: Documentation Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: chris.jerdonek, docs@python, jcea, ncoghlan, pitrou, python-dev, r.david.murray, terry.reedy
Priority: normal Keywords: easy, patch

Created on 2012-08-04 02:39 by chris.jerdonek, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue-splitlines-docs-1.patch chris.jerdonek, 2012-08-04 02:39 review
issue-15554-2.patch chris.jerdonek, 2012-08-05 18:20 review
issue-15554-3.patch chris.jerdonek, 2012-08-06 05:01 review
issue-15554-4.patch chris.jerdonek, 2012-08-06 19:50 review
Messages (12)
msg167394 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012-08-04 02:39
The documentation for str.splitlines()--

http://docs.python.org/dev/library/stdtypes.html#str.splitlines

includes a statement that is not quite correct:

"Unlike split(), if the string ends with line boundary characters the returned list does not have an empty last element."

For example,

>>> '\n'.splitlines()
['']
>>> '\n\n'.splitlines()
['', '']
>>> '\r\n'.splitlines()
['']
>>> '\n\r\n'.splitlines()
['', '']
>>> '\r'.splitlines()
['']
>>> 'a\n\n'.splitlines()
['a', '']

Also, the note about split() only applies when split() is passed a separator.  For example--

>>> 'a\n'.split('\n')
['a', '']
>>> 'a\n'.split()
['a']

Finally, the function's behavior on the empty string is another difference worth mentioning that is not covered by the existing note.

I am attaching a patch that addresses these points.  Notice also that the patch phrases it not as whether the list *has* an empty last element, but whether an *additional* last element should be added, which is the more important point.
msg167450 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-08-04 22:13
Sigh. ;)

At this point in my Python programming I intuitively understand what splitlines does, but every time we try to explain it in detail it gets messier and messier.  I wasn't really happy with the addition of that sentence about split in the first place.

I don't understand what your splitlines examples are trying to say, they all look clear to me based on the fact that we are splitting *lines*.  

I don't find your proposed language in the patch to be clearer.  The existing sentence describes the concrete behavior, while your version is sort-of describing (ascribing?) some syntax to the line separators ("does not delimit").  The problem is that there *is* a syntax here, that of universal-newline-delimited-text, but that is too big a topic to explain in the splitlines doc.  There's another issue for creating a central description of universal-newline parsing, perhaps this entry could link to that discussion (and that discussion could perhaps mention splitlines).

The split behavior without a specified separator might actually be a bug (if so, it is not a fixable one), but in any case you are right that that clarification should be added if the existing sentence is kept.
msg167508 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012-08-05 17:32
> I wasn't really happy with the addition of that sentence about split in the first place.

I think the instinct to put that sentence in there is a good one.  It is a key, perhaps subtle difference.

> I don't understand what your splitlines examples are trying to say, they all look clear to me based on the fact that we are splitting *lines*.  

I perhaps included too many examples and so clouded my point. :)  I just needed one.  The examples were simply to show why the existing language is not correct.  The current language says, "if the string ends with line boundary characters the returned list does not have an empty last element."

However, the examples are of strings that do end with line boundary characters but that *do* have an empty last element.

The point is that splitlines() does not count a terminal line break as an additional line, while split('\n') (for example) does.  But this is different from whether the returned list *has* an empty last element, which is what the current language says.

The returned list can have empty last elements because of line breaks at the end.  It's just that the one at the *very* end doesn't count towards that -- unlike the case for split():

>>> 'a'.splitlines()
['a']
>>> 'a\n'.splitlines()
['a']
>>> 'a\n\n'.splitlines()
['a', '']
>>> 'a\n\n\n'.splitlines()
['a', '', '']
>>> 'a\n\n\n'.split('\n')  # counts terminal line break as an extra line
['a', '', '', '']

I'm open to improving the language.  Maybe "does not count a terminal line break as an additional line" instead of the original "a terminal line break does not delimit an additional empty line"?

> There's another issue for creating a central description of universal-newline parsing, perhaps this entry could link to that discussion (and that discussion could perhaps mention splitlines).

I created that issue (issue 15543), and a patch is in the works along the lines you suggest. ;)

> The split behavior without a specified separator might actually be a bug (if so, it is not a fixable one), but in any case you are right that that clarification should be added if the existing sentence is kept.

Perhaps, but at least split() documents the behavior. :)

"runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace."

(from http://docs.python.org/dev/library/stdtypes.html#str.split )
msg167509 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012-08-05 18:20
Attaching patch with simplified wording in response to R. David Murray's feedback.

In particular, "a terminal line break does not delimit an additional empty line" -> "a terminal line break does not result in an extra line."
msg167531 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-08-06 01:59
Ah, now I see what you are talking about.  Yes, your revision in the comment is clearer; but, unless I read it wrong, in the patch it now sounds like you are saying that ''.splitlines() does not return the same result as ''.split() when in fact it does.

I would also prefer that the "differences" discussion come in the separate paragraph after the specification of the behavior of the function, rather than the way you have it split up in the patch.  I would include the mention of the lack-of-extra-line as part of the differences discussion: as I said I think that behavior follows logically from the fact that the function is splitting lines and so doesn't belong in the basic function description.
msg167537 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012-08-06 05:01
> in the patch it now sounds like you are saying that ''.splitlines() does not return the same result as ''.split() when in fact it does.

The two differences occur only when split() is passed a separator.  split() uses a different algorithm when no separator is specified.  For example, for the empty string case:

>>> ''.splitlines()
[]
>>> ''.split()
[]
>>> ''.split('\n')
['']

That is why I used the phrase "Unlike split() when passed a separator" in the patch:

+   Unlike :meth:`~str.split` when passed a separator, this method returns
+   an empty list for the empty string, and a terminal line break does not

I will change the language in the patch to parallel split()'s documentation more closely, to emphasize and make this distinction clearer: "when passed a separator" -> "when a delimiter string *sep* is given".

> I would also prefer that the "differences" discussion come in the separate paragraph after the specification of the behavior of the function,

Good point.  I agree with you.  That occurred to me while drafting the patch, but I was hesitant to change the existing structure too much.

In the updated patch I am attaching, I have also made that change.  Thanks a lot for reviewing!
msg167558 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-08-06 13:06
Ah, I read too quickly before.  But that expression "when a delimiter string *sep* is given" is hard to wrap ones head around in this context.  I think the problem really is that 'split' has such radically different behavior when given an argument as opposed to when it isn't.  I consider that a design flaw in split, and always have.  So, I suppose we can't do any better here because of that.

Please move the keeplines discussion back up into the initial paragraph, and then I think we'll be good to go.
msg167569 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012-08-06 18:10
> I think the problem really is that 'split' has such radically different behavior when given an argument as opposed to when it isn't.

Yep, the split() documentation is much more involved because of that.

> Please move the keeplines discussion back up into the initial paragraph, and then I think we'll be good to go.

Sounds good.  Would you also like me to move the example before the paragraph about differences, or should I leave the example at the end?

Mention of the example may flow better after the keepends discussion, because the example is more about keepends rather than about the differences with split().  But it can go either way.
msg167570 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-08-06 18:40
Good point.  Difference paragraph after the example would be best, I think.
msg167572 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012-08-06 19:50
Here you go.  Thanks again.
msg167574 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-08-06 20:09
New changeset 768b188262e7 by R David Murray in branch '3.2':
#15554: clarify splitlines/split differences.
http://hg.python.org/cpython/rev/768b188262e7

New changeset 0d6eea2330d0 by R David Murray in branch 'default':
Merge #15554: clarify splitlines/split differences.
http://hg.python.org/cpython/rev/0d6eea2330d0

New changeset e057a7d18fa2 by R David Murray in branch '2.7':
#15554: clarify splitlines/split differences.
http://hg.python.org/cpython/rev/e057a7d18fa2
msg167575 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-08-06 20:10
Thanks for sticking with it.
History
Date User Action Args
2022-04-11 14:57:33adminsetgithub: 59759
2012-08-06 20:10:30r.david.murraysetstatus: open -> closed
resolution: fixed
messages: + msg167575

stage: patch review -> resolved
2012-08-06 20:09:50python-devsetnosy: + python-dev
messages: + msg167574
2012-08-06 19:50:41chris.jerdoneksetfiles: + issue-15554-4.patch

messages: + msg167572
2012-08-06 18:40:32r.david.murraysetmessages: + msg167570
2012-08-06 18:10:36chris.jerdoneksetmessages: + msg167569
2012-08-06 13:06:21r.david.murraysetmessages: + msg167558
2012-08-06 13:06:02r.david.murraysetmessages: - msg167557
2012-08-06 13:05:18r.david.murraysetmessages: + msg167557
2012-08-06 05:01:43chris.jerdoneksetfiles: + issue-15554-3.patch

messages: + msg167537
2012-08-06 02:13:38terry.reedysetnosy: + terry.reedy
2012-08-06 01:59:01r.david.murraysetmessages: + msg167531
2012-08-05 18:20:25chris.jerdoneksetfiles: + issue-15554-2.patch

messages: + msg167509
2012-08-05 17:32:16chris.jerdoneksetmessages: + msg167508
2012-08-04 22:13:55r.david.murraysetnosy: + ncoghlan, r.david.murray
messages: + msg167450
2012-08-04 02:39:37chris.jerdonekcreate