Issue 18468: re.group() should never return a bytearray

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62668

classification

Title:	re.group() should never return a bytearray
Type:	behavior	Stage:	resolved
Components:	Library (Lib), Regular Expressions	Versions:	Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	Arfrever, ezio.melotti, gvanrossum, mrabarnett, python-dev, serhiy.storchaka
Priority:	normal	Keywords:	needs review, patch

Created on 2013-07-15 23:37 by gvanrossum, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
re_group_type.patch	serhiy.storchaka, 2013-10-01 19:26		review
re_group_type_2.patch	serhiy.storchaka, 2013-10-01 21:31		review

Messages (11)
msg193136 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2013-07-15 23:37
I discovered that the Python 3 version of the re module's Match object behaves subtly different from the Python 2 version when the target string (i.e. the haystack, not the needle) is a buffer object. In Python 2, the type of the return value of group() is always either a Unicode string or an 8-bit string, and the type is determined by looking at the target string -- if the target is unicode, group() returns a unicode string, otherwise, group() returns an 8-bit string. In particular, if the target is a buffer object, group() returns an 8-bit string. I think this is the appropriate behavior: otherwise using regular expression matching to extract a small substring from a large target string would unnecessarily keep the large target string alive as long as the substring is alive. But in Python 3, the behavior of group() has changed so that its return type always matches that of the target string. I think this is bad -- apart from the lifetime concern, it means that if your target happens to be a bytearray, the return value isn't even hashable! Proper behavior should be that .group() returned a bytes object if the input was binary data and a str object if the input was unicode data (str) regardless of specific types containing the input target data. Probably not much, if anything, would be depending on getting a bytearray out of that. Fix this in 3.4? 3.3 and earlier users are stuck with an extra bytes() call and data copy in these cases. [Further discussion at http://mail.python.org/pipermail/python-dev/2013-July/127332.html]
msg193289 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-07-18 13:05
I'm not sure it's worth changing it. As I see it, match/search are supposed to work with str or bytes and they return str/bytes accordingly. The fact that they work with other bytes-like objects seems to me an undocumented implementation detail people should not rely on. If they are passing bytes-like object, both the current behavior (return same type) or the new proposed behavior (always return bytes) seem reasonable expectations. IIUC the advantage of changing the behavior is that it won't keep the target string alive anymore, but on the other hand is not backward compatible and makes things more difficult for people who want the same type back. If people always want bytes back regardless of the input, they can convert the input or output to bytes explicitly.
msg193292 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2013-07-18 13:33
There's also the fact that the match object keeps a reference to the target string anyway: >>> import re >>> t = memoryview(b"a") >>> t <memory at 0x0100F110> >>> m = re.match(b"a", t) >>> m.string <memory at 0x0100F110> On that subject, buried in the source code (_sre.c) is the comment: """ /* FIXME: implement setattr("string", None) as a special case (to detach the associated string, if any */ """ In the regex module I added a method "detach_string" to perform that function.
msg193295 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-07-18 13:44
> match/search are supposed to work with str or bytes and > they return str/bytes accordingly. s/they return/calling m.group() returns/
msg193302 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2013-07-18 15:02
> Ezio Melotti added the comment: [...] > IIUC the advantage of changing the behavior is that it won't keep the target string alive anymore, but on the other hand is not backward compatible and makes things more difficult for people who want the same type back. Everyone seems to be afraid of backward compatibility here. I will take full responsibility, so let's just discuss what's the better API, regardless of what we did (and in 99% of the cases it's the same anyway). "People who want the same type back" -- there is no evidence that anyone wants this. "People who want a bytes object" -- this is definitely a valid use case. > If people always want bytes back regardless of the input, they can convert the input or output to bytes explicitly. But this requires an extra copy if the input is a bytearray. I suspect this might be the most commonly used non-bytes non-str target in Python 3 programs, and we are striving to support bytearray as input in as many places as possible where plain bytes is accepted. But generally getting bytearray as output requires a different API, e.g. recv_into(). I think a very reasonable general rule is that for functions that take either str or bytes and adjust their output to the input type, if their input is one of the bytes alternatives (bytearray, memoryview, array.array('b'), maybe others) the output is always a bytes object. The reason is that while the buffer API makes it easy to access the underlying bytes from C, it doesn't give you a way to create a new object of the same type (except by slicing, which doesn't always apply, e.g. os.listdir()). So for creating return values that match a memoryview (or bytearray, etc.) input, the only reasonable thing is to return a bytes object. (FWIW os.listdir() violates this too -- os.listdir(b'.') returns a list of bytes objects, while os.listdir(bytearray(b'.')) returns a list of str objects. This seems caused by revesed logic -- it probably tests "if the type is bytes" rather than "if the type isn't str" for the output type, even though it does the right thing with the input...)
msg194530 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-08-06 07:42
Here is a patch with an implementation and tests. Feel free to add a documentation changes if needed.
msg197558 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-13 08:22
Oh, seems I again did not attach a patch. Now I understand why there were no any feedback so long time.
msg198801 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-10-01 19:25
Fixed a typo. Could anyone please make a review?
msg198809 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-10-01 21:31
Updated patch addressed Antoine's comments.
msg200045 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-10-16 09:47
New changeset add40e9f7cbe by Serhiy Storchaka in branch 'default': Issue #18468: The re.split, re.findall, and re.sub functions and the group() http://hg.python.org/cpython/rev/add40e9f7cbe
msg200047 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-10-16 09:52
Thank you Antoine for your review.

History
Date	User	Action	Args
2022-04-11 14:57:48	admin	set	github: 62668
2013-10-16 09:52:24	serhiy.storchaka	set	status: open -> closed resolution: fixed messages: + msg200047 stage: patch review -> resolved
2013-10-16 09:47:19	python-dev	set	nosy: + python-dev messages: + msg200045
2013-10-16 09:34:21	serhiy.storchaka	set	assignee: serhiy.storchaka
2013-10-01 21:31:17	serhiy.storchaka	set	files: + re_group_type_2.patch messages: + msg198809
2013-10-01 19:26:56	serhiy.storchaka	set	files: + re_group_type.patch
2013-10-01 19:25:31	serhiy.storchaka	set	messages: + msg198801
2013-10-01 19:24:20	serhiy.storchaka	set	files: - re_group_type.patch
2013-09-13 08:22:15	serhiy.storchaka	set	keywords: + needs review, patch files: + re_group_type.patch messages: + msg197558
2013-08-06 07:42:57	serhiy.storchaka	set	messages: + msg194530 stage: needs patch -> patch review
2013-07-25 06:31:23	serhiy.storchaka	set	nosy: + serhiy.storchaka
2013-07-24 20:44:46	Arfrever	set	nosy: + Arfrever
2013-07-18 15:02:17	gvanrossum	set	messages: + msg193302
2013-07-18 13:44:45	ezio.melotti	set	messages: + msg193295
2013-07-18 13:33:39	mrabarnett	set	messages: + msg193292
2013-07-18 13:05:15	ezio.melotti	set	nosy: + ezio.melotti, mrabarnett messages: + msg193289 components: + Regular Expressions
2013-07-15 23:37:12	gvanrossum	create