Issue 39949: truncating match in regular expression match objects repr

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84130

classification

Title:	truncating match in regular expression match objects repr
Type:	enhancement	Stage:	patch review
Components:	Regular Expressions	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Seth.Troisi, eric.smith, ezio.melotti, matpi, mrabarnett, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2020-03-12 21:49 by Seth.Troisi, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 20922	closed	Seth.Troisi, 2020-06-16 20:43

Messages (17)
msg364052 - (view)	Author: Seth Troisi (Seth.Troisi) *	Date: 2020-03-12 21:49
Following on https://bugs.python.org/issue17087 Today I was mystified by why a regex wasn't working. >>> import re >>> re.match(r'.{10}', 'A'49+'B') <_sre.SRE_Match object; span=(0, 10), match='AAAAAAAAAA'> >>> re.match(r'.{49}', 'A'49+'B') <_sre.SRE_Match object; span=(0, 49), match='AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA> >>> re.match(r'.{50}', 'A'*49+'B') <_sre.SRE_Match object; span=(0, 50), match='AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA> I became confused on why the B wasn't matching in the third example; It is matching just in the interactive debugger it doesn't fit on the line and doesn't show My suggestion would be to truncate match (in the repr) and append '...' when it's right quote wouldn't show with short matches (or exactly enough space) there would be no change >>> re.match(r'.{48}', string.ascii_letters) <_sre.SRE_Match object; span=(0, 48), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV'> when not all of match can be displayed >>> re.match(r'.{49}', string.ascii_letters) <_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW> <_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS'...> I'm happy to help out by writing tests or impl if folks thing this is a good idea. I couldn't think of other examples (urllib maybe?) in Python of how this is handled but I could potentially look for some if that would help
msg364053 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-03-12 22:10
I think the missing closing quote is supposed to be your visual clue that it's truncated. Although I'll grant you that it's pretty subtle.
msg371618 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-06-16 07:11
+1 for adding an ellipsis. It's a conventional way to indicate that the displayed data is truncated. Concur with Eric that missing close quote is too subtle (and odd, and unexpected).
msg371626 - (view)	Author: Seth Troisi (Seth.Troisi) *	Date: 2020-06-16 09:49
I didn't propose a patch before because I was unsure of decision. Now that there is a +1 from Raymond I'll working on a patch and some documentation. Expect a patch within the week.
msg371641 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-06-16 12:30
There was a discussion in issue40984 that the repr must be eval-able. I don't feel very strongly about this, mainly because I don't think anyone ever does eval(repr(some_regex)). I'd be slightly sympathetic to wanting the eval to fail if the repr had to truncate its output, instead of succeeding because the string was still a valid, but different, regex.
msg371645 - (view)	Author: Quentin Wenger (matpi)	Date: 2020-06-16 12:42
For a bit of background, the other issue is about the repr of compiled patterns, not match objects. Please see my argument there about the conformance to repr's doc - merely adding an ellipsis would _not_ solve this case. I have however nothing against the pattern being truncated/ellipsed when inside the repr of a match object.
msg371648 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2020-06-16 12:52
Ah, I see. I missed that this issue was only about match objects. I apologize for the confusion. That being the case, I'll re-open the other issue.
msg371650 - (view)	Author: Quentin Wenger (matpi)	Date: 2020-06-16 12:56
@eric.smith thanks, no problem. If I can give any advice on this present issue, I would suggest to have the ellipsis _inside_ the quote, to make clear that the pattern is being truncated, not the match. So instead of ``` <_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS'...> ``` as suggested by @Seth.Troisi, I'd suggest ``` <_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS...'> ```
msg371698 - (view)	Author: Seth Troisi (Seth.Troisi) *	Date: 2020-06-16 21:19
@matpi The current behavior is for the right quote to not appear I kept this behavior but happy to consider changing that. See the linked patch for examples
msg371699 - (view)	Author: Quentin Wenger (matpi)	Date: 2020-06-16 21:29
Oh ok, I was mislead by the example in your first message, where you did have both the quote and ellipsis. I don't have a strong opinion. - having the quote is a bit more "clean" - but not having it makes clear than the pattern is truncated (per se, three dots is a valid pattern) The best would be to find a precedent in the stdlib, but I currently cannot think of any either.
msg371700 - (view)	Author: Quentin Wenger (matpi)	Date: 2020-06-16 21:53
File objects are an example of a square-bracket repr with string parameters in the repr, but no truncation is performed (see https://github.com/python/cpython/blob/master/Modules/_io/textio.c#L2912). Various truncations with the same (lack of?) clarity are done in the stdlib, see eg. https://github.com/python/cpython/blob/04fc4f2a46b2fd083639deb872c3a3037fdb47d6/Objects/longobject.c#L2475.
msg371843 - (view)	Author: Seth Troisi (Seth.Troisi) *	Date: 2020-06-19 00:05
I was thinking about how to add the end quote and found these weird cases: >>> "asdf'asdf'asdf" "asdf'asdf'asdf" >>> "asdf\"asdf\"asdf" 'asdf"asdf"asdf' >>> "asdf\"asdf'asdf" 'asdf"asdf\'asdf' This means that len(s) +2 (or 3 for bytes) != len(repr(s)) e.g. >>> s = "\"''''''" '"\'\'\'\'\'\'' >>> s >>> len(s) 7 >>> len(repr(s)) 15 This can lead to a weird partial trailing character >>> re.match(".", "a"48 + "'\"") <_sre.SRE_Match object; span=(0, 50), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\> This means I'll need to rethink len(group0) >= 48 as the condition for truncation (as a 30 length string can be truncated by %.50R) Maybe it makes sense to write group0 to a temp string and then check if that's truncated and extract the quote character from that OR PyUnicode_FromFormat('%R', group0[:50]) # avoids trailing escape character ('\') but might be longer than 50 characters
msg371864 - (view)	Author: Quentin Wenger (matpi)	Date: 2020-06-19 09:55
An extraneous difficulty also exists for bytes regexes, because there non-ascii characters are repr'ed using escape sequences. So there's a risk of cutting one in the middle. ``` >>> import re >>> re.match(b".*", b"\xce") <re.Match object; span=(0, 1), match=b'\xce'> ```
msg371867 - (view)	Author: Quentin Wenger (matpi)	Date: 2020-06-19 10:00
And ascii escapes should also not be forgotten. ``` >>> re.match(b".", b"\t") <re.Match object; span=(0, 1), match=b'\t'> >>> re.match(".", "\t") <re.Match object; span=(0, 1), match='\t'> ```
msg371868 - (view)	Author: Quentin Wenger (matpi)	Date: 2020-06-19 10:02
(but those are one-character escapes, so that should be fine - either the escape is complete or the backslash is trailing and can be "peeled of")
msg371869 - (view)	Author: Quentin Wenger (matpi)	Date: 2020-06-19 10:02
*off
msg371878 - (view)	Author: Quentin Wenger (matpi)	Date: 2020-06-19 11:11
Other pathological case: literal backslashes ``` >>> re.match(".*", r"\\\\\\") <re.Match object; span=(0, 6), match='\\\\\\\\\\\\'> ```

History
Date	User	Action	Args
2022-04-11 14:59:28	admin	set	github: 84130
2020-06-19 11:11:33	matpi	set	messages: + msg371878
2020-06-19 10:02:37	matpi	set	messages: + msg371869
2020-06-19 10:02:04	matpi	set	messages: + msg371868
2020-06-19 10:00:30	matpi	set	messages: + msg371867
2020-06-19 09:55:33	matpi	set	messages: + msg371864
2020-06-19 00:05:58	Seth.Troisi	set	messages: + msg371843
2020-06-16 21:53:36	matpi	set	messages: + msg371700
2020-06-16 21:29:20	matpi	set	messages: + msg371699
2020-06-16 21:19:38	Seth.Troisi	set	messages: + msg371698
2020-06-16 20:43:12	Seth.Troisi	set	keywords: + patch stage: needs patch -> patch review pull_requests: + pull_request20100
2020-06-16 12:56:43	matpi	set	messages: + msg371650
2020-06-16 12:54:54	eric.smith	unlink	issue40984 superseder
2020-06-16 12:52:20	eric.smith	set	messages: + msg371648
2020-06-16 12:42:13	matpi	set	nosy: + matpi messages: + msg371645
2020-06-16 12:30:13	eric.smith	set	messages: + msg371641
2020-06-16 12:28:14	eric.smith	link	issue40984 superseder
2020-06-16 09:49:49	Seth.Troisi	set	messages: + msg371626
2020-06-16 08:16:09	eric.smith	set	status: closed -> open nosy: + ezio.melotti, mrabarnett components: + Regular Expressions, - Library (Lib) resolution: not a bug -> stage: resolved -> needs patch
2020-06-16 07:11:03	rhettinger	set	nosy: + rhettinger messages: + msg371618
2020-03-24 20:57:35	Seth.Troisi	set	status: open -> closed resolution: not a bug stage: resolved
2020-03-12 22:10:27	eric.smith	set	nosy: + eric.smith messages: + msg364053 versions: + Python 3.9
2020-03-12 21:49:27	Seth.Troisi	create