classification
Title: re.group() should never return a bytearray
Type: behavior Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Arfrever, ezio.melotti, gvanrossum, mrabarnett, python-dev, serhiy.storchaka
Priority: normal Keywords: needs review, patch

Created on 2013-07-15 23:37 by gvanrossum, last changed 2013-10-16 09:52 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
re_group_type.patch serhiy.storchaka, 2013-10-01 19:26 review
re_group_type_2.patch serhiy.storchaka, 2013-10-01 21:31 review
Messages (11)
msg193136 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-07-15 23:37
I discovered that the Python 3 version of
the re module's Match object behaves subtly different from the Python
2 version when the target string (i.e. the haystack, not the needle)
is a buffer object.

In Python 2, the type of the return value of group() is always either
a Unicode string or an 8-bit string, and the type is determined by
looking at the target string -- if the target is unicode, group()
returns a unicode string, otherwise, group() returns an 8-bit string.
In particular, if the target is a buffer object, group() returns an
8-bit string. I think this is the appropriate behavior: otherwise
using regular expression matching to extract a small substring from a
large target string would unnecessarily keep the large target string
alive as long as the substring is alive.

But in Python 3, the behavior of group() has changed so that its
return type always matches that of the target string. I think this is
bad -- apart from the lifetime concern, it means that if your target
happens to be a bytearray, the return value isn't even hashable!

Proper behavior should be that .group() returned a bytes object if the input was binary data and a str object if the input was unicode data (str) regardless of specific types containing the input target data.

Probably not much, if anything, would be depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and earlier users are stuck with an extra bytes() call and data copy in these cases.

[Further discussion at http://mail.python.org/pipermail/python-dev/2013-July/127332.html]
msg193289 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-07-18 13:05
I'm not sure it's worth changing it.
As I see it, match/search are supposed to work with str or bytes and they return str/bytes accordingly.  The fact that they work with other bytes-like objects seems to me an undocumented implementation detail people should not rely on.
If they are passing bytes-like object, both the current behavior (return same type) or the new proposed behavior (always return bytes) seem reasonable expectations.

IIUC the advantage of changing the behavior is that it won't keep the target string alive anymore, but on the other hand is not backward compatible and makes things more difficult for people who want the same type back.
If people always want bytes back regardless of the input, they can convert the input or output to bytes explicitly.
msg193292 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2013-07-18 13:33
There's also the fact that the match object keeps a reference to the target string anyway:

>>> import re
>>> t = memoryview(b"a")
>>> t
<memory at 0x0100F110>
>>> m = re.match(b"a", t)
>>> m.string
<memory at 0x0100F110>

On that subject, buried in the source code (_sre.c) is the comment:
"""
/* FIXME: implement setattr("string", None) as a special case (to
   detach the associated string, if any */
"""

In the regex module I added a method "detach_string" to perform that function.
msg193295 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-07-18 13:44
> match/search are supposed to work with str or bytes and
> they return str/bytes accordingly.

s/they return/calling m.group() returns/
msg193302 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2013-07-18 15:02
> Ezio Melotti added the comment:
[...]
> IIUC the advantage of changing the behavior is that it won't keep the target string alive anymore, but on the other hand is not backward compatible and makes things more difficult for people who want the same type back.

Everyone seems to be afraid of backward compatibility here. I will
take full responsibility, so let's just discuss what's the better API,
regardless of what we did (and in 99% of the cases it's the same
anyway).

"People who want the same type back" -- there is no evidence that
anyone wants this. "People who want a bytes object" -- this is
definitely a valid use case.

> If people always want bytes back regardless of the input, they can convert the input or output to bytes explicitly.

But this requires an extra copy if the input is a bytearray. I suspect
this might be the most commonly used non-bytes non-str target in
Python 3 programs, and we are striving to support bytearray as input
in as many places as possible where plain bytes is accepted. But
generally getting bytearray as output requires a different API, e.g.
recv_into().

I think a very reasonable general rule is that for functions that take
either str or bytes and adjust their output to the input type, if
their input is one of the bytes alternatives (bytearray, memoryview,
array.array('b'), maybe others) the output is always a bytes object.

The reason is that while the buffer API makes it easy to access the
underlying bytes from C, it doesn't give you a way to create a new
object of the same type (except by slicing, which doesn't always
apply, e.g. os.listdir()). So for creating return values that match a
memoryview (or bytearray, etc.) input, the only reasonable thing is to
return a bytes object.

(FWIW os.listdir() violates this too -- os.listdir(b'.') returns a
list of bytes objects, while os.listdir(bytearray(b'.')) returns a
list of str objects. This seems caused by revesed logic -- it probably
tests "if the type is bytes" rather than "if the type isn't str" for
the output type, even though it does the right thing with the
input...)
msg194530 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-06 07:42
Here is a patch with an implementation and tests. Feel free to add a documentation changes if needed.
msg197558 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-13 08:22
Oh, seems I again did not attach a patch. Now I understand why there were no any feedback so long time.
msg198801 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-01 19:25
Fixed a typo.

Could anyone please make a review?
msg198809 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-01 21:31
Updated patch addressed Antoine's comments.
msg200045 - (view) Author: Roundup Robot (python-dev) Date: 2013-10-16 09:47
New changeset add40e9f7cbe by Serhiy Storchaka in branch 'default':
Issue #18468: The re.split, re.findall, and re.sub functions and the group()
http://hg.python.org/cpython/rev/add40e9f7cbe
msg200047 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-16 09:52
Thank you Antoine for your review.
History
Date User Action Args
2013-10-16 09:52:24serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg200047

stage: patch review -> resolved
2013-10-16 09:47:19python-devsetnosy: + python-dev
messages: + msg200045
2013-10-16 09:34:21serhiy.storchakasetassignee: serhiy.storchaka
2013-10-01 21:31:17serhiy.storchakasetfiles: + re_group_type_2.patch

messages: + msg198809
2013-10-01 19:26:56serhiy.storchakasetfiles: + re_group_type.patch
2013-10-01 19:25:31serhiy.storchakasetmessages: + msg198801
2013-10-01 19:24:20serhiy.storchakasetfiles: - re_group_type.patch
2013-09-13 08:22:15serhiy.storchakasetkeywords: + needs review, patch
files: + re_group_type.patch
messages: + msg197558
2013-08-06 07:42:57serhiy.storchakasetmessages: + msg194530
stage: needs patch -> patch review
2013-07-25 06:31:23serhiy.storchakasetnosy: + serhiy.storchaka
2013-07-24 20:44:46Arfreversetnosy: + Arfrever
2013-07-18 15:02:17gvanrossumsetmessages: + msg193302
2013-07-18 13:44:45ezio.melottisetmessages: + msg193295
2013-07-18 13:33:39mrabarnettsetmessages: + msg193292
2013-07-18 13:05:15ezio.melottisetnosy: + ezio.melotti, mrabarnett
messages: + msg193289
components: + Regular Expressions
2013-07-15 23:37:12gvanrossumcreate