classification
Title: whitespace in strip()/lstrip()/rstrip()
Type: enhancement Stage: needs patch
Components: Documentation Versions: Python 3.8, Python 3.7, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Dimitri Papadopoulos Orfanos, docs@python, edmundselliot@gmail.com, ezio.melotti, joel.johnson, jwilk, nitishch, oulenz
Priority: normal Keywords: easy

Created on 2015-10-18 12:15 by Dimitri Papadopoulos Orfanos, last changed 2018-08-28 18:25 by edmundselliot@gmail.com.

Files
File name Uploaded Description Edit
whitespace_regex.py edmundselliot@gmail.com, 2018-08-28 18:25
Messages (7)
msg253152 - (view) Author: Dimitri Papadopoulos Orfanos (Dimitri Papadopoulos Orfanos) Date: 2015-10-18 12:15
The documentation of strip() / lstrip() / rstrip() should define "whitespace" more precisely.

The Python 3 documentation refers to "ASCII whitespace" for bytes.strip() / bytes.lstrip() / bytes.rstrip() and "whitespace" for str.strip() / str.lstrip() / str.rstrip(). I suggest the following improvements:
* add a link from "ASCII whitespace" to string.whitespace or bytes.isspace(),
* define plain "whitespace" more precisely (possibly with a link to str.isspace()).

The Python 2 documentation refers to plain "whitespace". As far as I know strip() removes ASCII whitespaces only. If so, please:
* add a link to string.whitespace or str.isspace(),
* improve the string.whitespace documentation and explain that it is locale-dependent (see documentation of str.isspace()).
msg257449 - (view) Author: Dimitri Papadopoulos Orfanos (Dimitri Papadopoulos Orfanos) Date: 2016-01-04 09:42
In Python 2, as far as I can understand, string.whitespace and str.isspace() are different:
* str.isspace() is built upon the C isspace() function and is therefore locale-dependant. Python heavily relies on isspace() to detect "whitespace" characters.
* string.whitespace is a list of "ASCII whitespace characters" carved in stone. As far as I can see string.whitespace is defined but not used anywhere in Python source code.

See source code:
* Modules/stringobject.c around line 3319:
  [...]
  string_isspace(PyStringObject *self)
  {
  [...]
      e = p + PyString_GET_SIZE(self);
      for (; p < e; p++) {
          if (!isspace(*p))
              return PyBool_FromLong(0);
      }
      return PyBool_FromLong(1);
  [...]
* Lib/string.py near line 23:
  whitespace = ' \t\n\r\v\f'

Functions strip()/lstrip()/rstrip() use str.isspace() and have nothing to do with string.whitespace:

* Modules/stringobject.c around line 1861:
[...]
do_strip(PyStringObject *self, int striptype)
{
[...]
    i = 0;
    if (striptype != RIGHTSTRIP) {
        while (i < len && isspace(Py_CHARMASK(s[i]))) {
            i++;
        }
    }
[...]

Therefore I suggest the documentation of Python 2.7 points to str.isspace() wherever the term "whitespace" is used in the documentation - including this specific case of strip()/lstrip()/rstrip().
msg257450 - (view) Author: Dimitri Papadopoulos Orfanos (Dimitri Papadopoulos Orfanos) Date: 2016-01-04 10:06
In Python 3 the situation is similar:
* The Py_UNICODE_ISSPACE macro is used internally to define str.isspace() and wherever Python needs to detect "whitespace" characters in strings.
* There is an equivalent function Py_ISSPACE for bytes/bytearray.
* The bytearray.strip() implementation for bytearray relies on hardcoded ASCII whitespaces instead of Py_ISSPACE.
* string.whitespace is a list of "ASCII whitespace characters" carved in stone. As far as I can see string.whitespace is defined but not used anywhere in Python source code.

Therefore I suggest the documentation of Python 3 points to str.isspace() wherever the term "whitespace" is used in any documentation related to strings - including this specific case of strip()/lstrip()/rstrip().
msg314668 - (view) Author: Joel Johnson (joel.johnson) Date: 2018-03-29 19:59
I have started working on this and will have a pull request submitted by the end of the week. 

The term "whitespace" appears in several contextual situations throughout the documentation. While all situations would benefit from the definition of "whitespace" contained in the str.isspace() documentation, not all of the situations would benefit from a link to str.isspace() whose primary goal is to document the str.isspace() function and not to provide a global definition of what a whitespace character is.

Therefore I suggest the documentation of Python 3 create a new glossary definition of "whitespace" (which contains the definition currently in the str.isspace() documentation) and is pointed to wherever the term "whitespace" is used in any documentation related to strings - including this specific case of strip()/lstrip()/rstrip().
msg314681 - (view) Author: Dimitri Papadopoulos Orfanos (Dimitri Papadopoulos Orfanos) Date: 2018-03-30 06:39
I agree on avoiding a link to str.isspace() and defining "whitespace" instead.

However please note there are many de facto definitions of "whitespace". All of them must be documented - or at least the conceptual classes of "whitespace" and clarify which class each of the following belongs to:

* Unicode whitespaces are by very far the most common: str.isspace(), strip()/lstrip()/rstrip(), Py_UNICODE_ISSPACE.

* Py_ISSPACE targets byte/bytearray but is never used!

* bytearray.strip() does not use Py_ISSPACE but a hardcoded list of ASCII whitespaces instead.

* finally string.whitespace is probably equivalent to the list used by bytearray.strip().

Beyond the docs, I think Python 3 should rationalize bytearray.strip() /  Py_ISSPACE / string.whitespace, probably having bytearray.strip() rely on Py_ISSPACE, and Py_ISSPACE rely on string.whitespace unless string.whitespace is obsoleted.
msg315193 - (view) Author: Oliver Urs Lenz (oulenz) Date: 2018-04-11 14:49
Slightly tangential, but it would be great if the documentation of lstrip() and rstrip() could include an equivalent definition in terms of re.sub(), e.g.:

lstrip(foo) == re.sub(r'(?u)\A\s*', '', foo)
rstrip(foo) == re.sub(r'(?u)\s*\Z', '', foo)

(Or whatever else is correct.)
msg324272 - (view) Author: Elliot Edmunds (edmundselliot@gmail.com) Date: 2018-08-28 18:25
Not sure how helpful it would be to have the re.sub expressions for lstrip and rstrip, but I think it would look like:

l_stripped = re.sub(r'^\s*', '', foo)
r_stripped = re.sub(r'\s*$', '', foo)
History
Date User Action Args
2018-08-28 18:25:07edmundselliot@gmail.comsetfiles: + whitespace_regex.py
nosy: + edmundselliot@gmail.com
messages: + msg324272

2018-04-11 14:49:18oulenzsetnosy: + oulenz
messages: + msg315193
2018-03-31 22:32:24jwilksetnosy: + jwilk
2018-03-31 20:57:20nitishchsetnosy: + nitishch
2018-03-30 06:39:52Dimitri Papadopoulos Orfanossetmessages: + msg314681
2018-03-29 19:59:00joel.johnsonsetnosy: + joel.johnson
messages: + msg314668
2018-03-19 17:51:21cheryl.sabellasetversions: + Python 3.7, Python 3.8, - Python 3.5, Python 3.6
2016-01-04 10:06:11Dimitri Papadopoulos Orfanossetmessages: + msg257450
2016-01-04 09:42:13Dimitri Papadopoulos Orfanossetmessages: + msg257449
2016-01-04 03:41:04ezio.melottisetkeywords: + easy
nosy: + ezio.melotti
stage: needs patch

versions: + Python 3.6
2015-10-18 12:15:36Dimitri Papadopoulos Orfanoscreate