classification
Title: split(None, maxsplit) does not strip whitespace correctly
Type: behavior Stage:
Components: Documentation Versions: Python 3.0, Python 2.4, Python 2.6, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: brett.cannon, effbot, fdrake, georg.brandl, jafo, nirs
Priority: low Keywords:

Created on 2007-09-07 01:18 by nirs, last changed 2007-10-08 07:50 by georg.brandl. This issue is now closed.

Messages (11)
msg55720 - (view) Author: Nir Soffer (nirs) * Date: 2007-09-07 01:18
string object .split doc say (http://docs.python.org/lib/string-
methods.html):

    "If sep is not specified or is None, a different splitting algorithm 
is applied. First, whitespace characters (spaces, tabs, newlines, 
returns, and formfeeds) are stripped from both ends."

If the maxsplit argument is set and is smaller then the number of 
possible parts, whitespace is not removed.

Examples:

>>> 'k: v\n'.split(None, 1)
['k:', 'v\n']

Expected: ['k:', 'v']

>>> u'k: v\n'.split(None, 1)
[u'k:', u'v\n']

Expected: [u'k:', u'v']

With larger values of maxsplits, it works correctly:

>>> 'k: v\n'.split(None, 2)
['k:', 'v']
>>> u'k: v\n'.split(None, 2)
[u'k:', u'v']

This looks like implementation bug, because there it does not make sense 
that the striping depends on the maxsplit argument, and it will be hard 
to explain such behavior.

Maybe the striping should be removed in Python 3? It does not make sense 
to strip a string behind your back when you want to split it, and the 
caller can easily strip the string if needed.
msg55806 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2007-09-10 22:13
Looks like a *documentation* bug to me; at the implementation level,
None just means "no empty parts, treat runs of whitespace as separators".
msg55807 - (view) Author: Nir Soffer (nirs) * Date: 2007-09-10 22:32
I did not look into the source, but obviously there is striping of 
leading and trailing whitespace. 

When you specify a separator you get:
>>> '  '.split(' ')
['', '', '']

>>> '  a  b  '.split('  ')
['', 'a', 'b', '']

So one would expect to get this without striping:
>>> '  a  b  '.split()
['', 'a', 'b', '']

But you get this:
>>> '  a  b  '.split()
['a', 'b']

So the documentation is correct.
msg55809 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2007-09-10 22:41
But wasn't your complaint that the implementation didn't match the
documentation?

As I said, the *implementation* treats "runs of whitespace" as
separators, except for whitespace at the beginning or end (or in other
words, it never returns empty strings).  That matches the documentation,
except for the "first" in "first, whitespace characters are stripped
from both ends".   As far as I can tell, the documentation has never
matched the implementation here.
msg55819 - (view) Author: Nir Soffer (nirs) * Date: 2007-09-11 11:12
There is a problem only when maxsplit is smaller than the available 
splits. In other cases, the docs and the behavior match.
msg55962 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-09-17 11:05
I believe this is just a place where the documentation could be cleared
up.  Seems to me the confusion is from the document saying
(paraphrased): "white space is removed from both ends".

Perhaps it should say something like "runs of 1 or more whitespace are
collapsed (up to the maximum split), and then split on" or simply "split
on runs of 1 or more whitespace.  In other words, 3 spaces together
would be treated as a single split-point instead of 3 0-length fields
separated by spaces."

So, in the first example provided by "nirs" in this issue, "both ends"
refers to both the left and right side of "k:".  Since maxsplit is 1,
the second part (v) is left untouched.  This is the intended operation.

This is a documentation bug, not a library bug.

Fred: Thoughts on wording?
msg56021 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2007-09-19 02:22
The algorithm is actually kind of odd::

  >>> " a b".split(None, 0)
  ['a b']
  >>> "a b ".split(None, 0)
  ['a b ']
  >>> "a b ".split(None, 1)
  ['a', 'b ']

So trailing whitespace on the original string is stripped only if the
number of splits is great enough to lead to a possible split past the
last element.  But leading whitespace is always removed.

Basically the algorithm stops looking for whitespace once it has
encountered maxsplit instances of contiguous whitespace plus leading
whitespace.
msg56024 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-09-19 02:42
In looking at the current documentation:

http://docs.python.org/dev/library/string.html#string.split

I don't see the wording the original poster mentions.  The current
documentation of the separator is clear and reasonable.  I'm going to
call this closed, unless someone can suggest specific wording changes to
the document let's call this done.
msg56026 - (view) Author: Nir Soffer (nirs) * Date: 2007-09-19 04:12
I quoted str.split docs:

- http://docs.python.org/lib/string-methods.html
- http://docs.python.org/dev/library/stdtypes.html
- http://docs.python.org/dev/3.0/library/stdtypes.html

string.split doc does it explain this:

>>> ' a b '.split(None, 1)
['a', 'b ']
>>> ' a b '.split(None, 2)
['a', 'b']

.split method docs is more clear and describe this in a very simple way. 

This is a better description of the current behavior:

    "If sep is not specified or is None, a different splitting algorithm 
is applied. First, whitespace characters (spaces, tabs, newlines, 
returns, and formfeeds) are stripped from the start of the string. Then, 
words are separated by arbitrary length strings of whitespace 
characters. Consecutive whitespace delimiters are treated as a single 
delimiter ("' 1 \t 2 \n 3 '.split()" returns "['1', '2', '3']").

    If maxsplit is nonzero, at most maxsplit number of splits occur, and 
the remainder of the string is returned as the final element of the 
list, unless it is empty. Splitting an empty string or a string 
consisting of just whitespace returns an empty list."
msg56257 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2007-10-07 20:05
Re-opening as jafo was referring to the string module's function
implementation which is deprecated.  The real  issue is that the
built-in types docs are bad.
msg56272 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-10-08 07:50
This should now be fixed in r58368.
History
Date User Action Args
2007-10-08 07:50:36georg.brandlsetstatus: open -> closed
nosy: + georg.brandl
resolution: fixed
messages: + msg56272
2007-10-07 20:05:53brett.cannonlinkissue1240 superseder
2007-10-07 20:05:30brett.cannonsetstatus: closed -> open
assignee: fdrake ->
messages: + msg56257
resolution: not a bug -> (no value)
versions: + Python 2.6
2007-09-19 04:12:51nirssetmessages: + msg56026
2007-09-19 02:42:36jafosetstatus: open -> closed
resolution: not a bug
messages: + msg56024
2007-09-19 02:22:39brett.cannonsetnosy: + brett.cannon
messages: + msg56021
2007-09-17 11:05:20jafosetpriority: low
assignee: fdrake
messages: + msg55962
components: + Documentation, - Library (Lib)
nosy: + fdrake, jafo
2007-09-11 11:12:44nirssetmessages: + msg55819
2007-09-10 22:41:05effbotsetmessages: + msg55809
2007-09-10 22:32:47nirssetmessages: + msg55807
2007-09-10 22:13:35effbotsetnosy: + effbot
messages: + msg55806
2007-09-07 17:31:54gvanrossumsetmessages: - msg55726
2007-09-07 17:31:49gvanrossumsetmessages: - msg55721
2007-09-07 02:04:13nirssettype: behavior
messages: + msg55726
2007-09-07 01:19:23nirssetmessages: + msg55721
title: split(None, maxplit) does not strip whitespace correctly -> split(None, maxsplit) does not strip whitespace correctly
2007-09-07 01:18:40nirscreate