classification
Title: re module doesn't describe string boundaries for \b
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: Ron.Ridley, docs@python, eric.araujo, ezio.melotti, poolie, python-dev, ralph.corderoy
Priority: normal Keywords: easy, patch

Created on 2010-12-16 01:05 by ralph.corderoy, last changed 2012-02-29 09:51 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
20110822-1604-re-docs.diff poolie, 2011-08-22 06:05
issue10713.diff ezio.melotti, 2012-02-27 12:24 Patch against 3.2. review
Messages (8)
msg124097 - (view) Author: Ralph Corderoy (ralph.corderoy) Date: 2010-12-16 01:05
The re module defines \b in a regexp to need \w one side and \W the other.  What about when the end of the string or line is involved?  perlre(1) says that's treated as a \W.  Python should precisely document that case too.
msg135466 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-05-07 15:09
Thanks for the report.  Would you be interested in experimenting and/or reading the code to find the anwser and propose a doc patch?
msg135524 - (view) Author: Ralph Corderoy (ralph.corderoy) Date: 2011-05-08 14:27
Examining the source of Ubuntu's python2.6 2.6.6-5ubuntu1 package
suggests beyond the limits of the string is considered \W, like Perl.

    Modules/_sre.c:
       336  LOCAL(int)
       337  SRE_AT(SRE_STATE* state, SRE_CHAR* ptr, SRE_CODE at)
       338  {
       339      /* check if pointer is at given position */
       340
       341      Py_ssize_t thisp, thatp;
       ...
       365      case SRE_AT_BOUNDARY:
       366          if (state->beginning == state->end)
       367              return 0;
       368          thatp = ((void*) ptr > state->beginning) ?
       369              SRE_IS_WORD((int) ptr[-1]) : 0;
       370          thisp = ((void*) ptr < state->end) ?
       371              SRE_IS_WORD((int) ptr[0]) : 0;
       372          return thisp != thatp;

SRE_IS_WORD() returns 16 for the 63 \w characters, 0 otherwise.

This is born out by tests.

Note, 366 above confirms it's never true for an empty string.  The
documentation states that \B "is just the opposite of \b" yet
re.match(r'\b', '') returns None and so does \B so \B isn't the opposite
of \b in all cases.
msg142679 - (view) Author: Martin Pool (poolie) Date: 2011-08-22 06:05
> Note, 366 above confirms it's never true for an empty string.  The
documentation states that \B "is just the opposite of \b" yet
re.match(r'\b', '') returns None and so does \B so \B isn't the opposite
of \b in all cases.

This is also a bit strange if you follow the Perl line of reasoning of imagining there are non-word characters outside the string.  And, indeed, in Perl, 

  "" =~ /\B/

is true.

So this patch adds some tests for \b behaviour and some docs.  I think possible \B should actually change, but that would be a bigger (perhaps impossible?) change.
msg154470 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-02-27 12:24
This is a new patch based on Martin work.
I don't think it's necessary to explain what happens while using r'\b' or r'\B' on an empty string in the doc -- that's not a common case and it might end up confusing users.  I think however that a couple of examples might help them figuring out what they are useful for.
Mentioning that they work with the beginning/end of the string too is a reasonable request, so I tweaked the doc to point that out.
msg154479 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-27 13:28
Like it.
msg154607 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-02-29 09:50
New changeset fc89e09ca2fc by Ezio Melotti in branch '2.7':
#10713: Improve documentation for \b and \B and add a few tests.  Initial patch and tests by Martin Pool.
http://hg.python.org/cpython/rev/fc89e09ca2fc

New changeset cde7fa40b289 by Ezio Melotti in branch '3.2':
#10713: Improve documentation for \b and \B and add a few tests.  Initial patch and tests by Martin Pool.
http://hg.python.org/cpython/rev/cde7fa40b289

New changeset b78ca038e468 by Ezio Melotti in branch 'default':
#10713: merge with 3.2.
http://hg.python.org/cpython/rev/b78ca038e468
msg154608 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-02-29 09:51
Fixed, thanks for the patch!
History
Date User Action Args
2012-02-29 09:51:34ezio.melottisetstatus: open -> closed
messages: + msg154608

assignee: docs@python -> ezio.melotti
resolution: fixed
stage: patch review -> resolved
2012-02-29 09:50:09python-devsetnosy: + python-dev
messages: + msg154607
2012-02-27 13:28:43eric.araujosetmessages: + msg154479
2012-02-27 12:24:21ezio.melottisetfiles: + issue10713.diff
versions: - Python 3.1
messages: + msg154470

type: enhancement
stage: needs patch -> patch review
2011-08-22 06:05:57pooliesetfiles: + 20110822-1604-re-docs.diff

nosy: + poolie
messages: + msg142679

keywords: + patch
2011-05-12 17:29:03Ron.Ridleysetnosy: + Ron.Ridley
2011-05-08 14:27:09ralph.corderoysetmessages: + msg135524
2011-05-07 15:10:31ezio.melottisetnosy: + ezio.melotti
2011-05-07 15:09:45eric.araujosetversions: + Python 3.1, Python 2.7, Python 3.2, Python 3.3
nosy: + eric.araujo

messages: + msg135466

keywords: + easy
stage: needs patch
2010-12-16 01:05:33ralph.corderoycreate