classification
Title: truncating match in regular expression match objects repr
Type: enhancement Stage: patch review
Components: Regular Expressions Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Seth.Troisi, eric.smith, ezio.melotti, matpi, mrabarnett, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2020-03-12 21:49 by Seth.Troisi, last changed 2020-06-19 11:11 by matpi.

Pull Requests
URL Status Linked Edit
PR 20922 closed Seth.Troisi, 2020-06-16 20:43
Messages (17)
msg364052 - (view) Author: Seth Troisi (Seth.Troisi) * Date: 2020-03-12 21:49
Following on https://bugs.python.org/issue17087

Today I was mystified by why a regex wasn't working.

    >>> import re
    >>> re.match(r'.{10}', 'A'*49+'B')
    <_sre.SRE_Match object; span=(0, 10), match='AAAAAAAAAA'>

    >>> re.match(r'.{49}', 'A'*49+'B')
    <_sre.SRE_Match object; span=(0, 49), match='AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA>

    >>> re.match(r'.{50}', 'A'*49+'B')
    <_sre.SRE_Match object; span=(0, 50), match='AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA>

I became confused on why the B wasn't matching in the third example; It is matching just
in the interactive debugger it doesn't fit on the line and doesn't show


My suggestion would be to truncate match (in the repr) and append '...' when it's right quote wouldn't show


with short matches (or exactly enough space) there would be no change

    >>> re.match(r'.{48}', string.ascii_letters)
    <_sre.SRE_Match object; span=(0, 48), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV'>

when not all of match can be displayed

    >>> re.match(r'.{49}', string.ascii_letters)
    <_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW>
    <_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS'...>


I'm happy to help out by writing tests or impl if folks thing this is a good idea.

I couldn't think of other examples (urllib maybe?) in Python of how this is handled but I could potentially look for some if that would help
msg364053 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-03-12 22:10
I think the missing closing quote is supposed to be your visual clue that it's truncated. Although I'll grant you that it's pretty subtle.
msg371618 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-06-16 07:11
+1 for adding an ellipsis.  It's a conventional way to indicate that the displayed data is truncated.

Concur with Eric that missing close quote is too subtle (and odd, and unexpected).
msg371626 - (view) Author: Seth Troisi (Seth.Troisi) * Date: 2020-06-16 09:49
I didn't propose a patch before because I was unsure of decision. Now that there is a +1 from Raymond I'll working on a patch and some documentation. Expect a patch within the week.
msg371641 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-06-16 12:30
There was a discussion in issue40984 that the repr must be eval-able. I don't feel very strongly about this, mainly because I don't think anyone ever does eval(repr(some_regex)). I'd be slightly sympathetic to wanting the eval to fail if the repr had to truncate its output, instead of succeeding because the string was still a valid, but different, regex.
msg371645 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 12:42
For a bit of background, the other issue is about the repr of compiled patterns, not match objects.
Please see my argument there about the conformance to repr's doc - merely adding an ellipsis would _not_ solve this case.

I have however nothing against the pattern being truncated/ellipsed when inside the repr of a match object.
msg371648 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-06-16 12:52
Ah, I see. I missed that this issue was only about match objects. I apologize for the confusion.

That being the case, I'll re-open the other issue.
msg371650 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 12:56
@eric.smith thanks, no problem.

If I can give any advice on this present issue, I would suggest to have the ellipsis _inside_ the quote, to make clear that the pattern is being truncated, not the match. So instead of

```
<_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS'...>
```

as suggested by @Seth.Troisi, I'd suggest

```
<_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS...'>
```
msg371698 - (view) Author: Seth Troisi (Seth.Troisi) * Date: 2020-06-16 21:19
@matpi

The current behavior is for the right quote to not appear I kept this behavior but happy to consider changing that.

See the linked patch for examples
msg371699 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 21:29
Oh ok, I was mislead by the example in your first message, where you did have both the quote and ellipsis.

I don't have a strong opinion.
- having the quote is a bit more "clean"
- but not having it makes clear than the pattern is truncated (per se, three dots is a valid pattern)

The best would be to find a precedent in the stdlib, but I currently cannot think of any either.
msg371700 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 21:53
File objects are an example of a square-bracket repr with string parameters in the repr, but no truncation is performed (see https://github.com/python/cpython/blob/master/Modules/_io/textio.c#L2912).

Various truncations with the same (lack of?) clarity are done in the stdlib, see eg. https://github.com/python/cpython/blob/04fc4f2a46b2fd083639deb872c3a3037fdb47d6/Objects/longobject.c#L2475.
msg371843 - (view) Author: Seth Troisi (Seth.Troisi) * Date: 2020-06-19 00:05
I was thinking about how to add the end quote and found these weird cases:
  >>> "asdf'asdf'asdf"
  "asdf'asdf'asdf"
  >>> "asdf\"asdf\"asdf"
  'asdf"asdf"asdf'
  >>> "asdf\"asdf'asdf"
  'asdf"asdf\'asdf'

This means that len(s) +2 (or 3 for bytes) != len(repr(s))
e.g.

>>> s = "\"''''''"
'"\'\'\'\'\'\''
>>> s
>>> len(s)
7
>>> len(repr(s))
15

This can lead to a weird partial trailing character 
  >>> re.match(".*", "a"*48 + "'\"")
  <_sre.SRE_Match object; span=(0, 50), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\>


This means I'll need to rethink len(group0) >= 48 as the condition for truncation (as a 30 length string can be truncated by %.50R)

Maybe it makes sense to write group0 to a temp string and then check if that's truncated and extract the quote character from that
OR
PyUnicode_FromFormat('%R', group0[:50]) # avoids trailing escape character ('\') but might be longer than 50 characters
msg371864 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-19 09:55
An extraneous difficulty also exists for bytes regexes, because there non-ascii characters are repr'ed using escape sequences. So there's a risk of cutting one in the middle.

```
>>> import re
>>> re.match(b".*", b"\xce")
<re.Match object; span=(0, 1), match=b'\xce'>
```
msg371867 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-19 10:00
And ascii escapes should also not be forgotten.

```
>>> re.match(b".*", b"\t")
<re.Match object; span=(0, 1), match=b'\t'>
>>> re.match(".*", "\t")
<re.Match object; span=(0, 1), match='\t'>
```
msg371868 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-19 10:02
(but those are one-character escapes, so that should be fine - either the escape is complete or the backslash is trailing and can be "peeled of")
msg371869 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-19 10:02
*off
msg371878 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-19 11:11
Other pathological case: literal backslashes

```
>>> re.match(".*", r"\\\\\\")
<re.Match object; span=(0, 6), match='\\\\\\\\\\\\'>
```
History
Date User Action Args
2020-06-19 11:11:33matpisetmessages: + msg371878
2020-06-19 10:02:37matpisetmessages: + msg371869
2020-06-19 10:02:04matpisetmessages: + msg371868
2020-06-19 10:00:30matpisetmessages: + msg371867
2020-06-19 09:55:33matpisetmessages: + msg371864
2020-06-19 00:05:58Seth.Troisisetmessages: + msg371843
2020-06-16 21:53:36matpisetmessages: + msg371700
2020-06-16 21:29:20matpisetmessages: + msg371699
2020-06-16 21:19:38Seth.Troisisetmessages: + msg371698
2020-06-16 20:43:12Seth.Troisisetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request20100
2020-06-16 12:56:43matpisetmessages: + msg371650
2020-06-16 12:54:54eric.smithunlinkissue40984 superseder
2020-06-16 12:52:20eric.smithsetmessages: + msg371648
2020-06-16 12:42:13matpisetnosy: + matpi
messages: + msg371645
2020-06-16 12:30:13eric.smithsetmessages: + msg371641
2020-06-16 12:28:14eric.smithlinkissue40984 superseder
2020-06-16 09:49:49Seth.Troisisetmessages: + msg371626
2020-06-16 08:16:09eric.smithsetstatus: closed -> open

nosy: + ezio.melotti, mrabarnett
components: + Regular Expressions, - Library (Lib)
resolution: not a bug ->
stage: resolved -> needs patch
2020-06-16 07:11:03rhettingersetnosy: + rhettinger
messages: + msg371618
2020-03-24 20:57:35Seth.Troisisetstatus: open -> closed
resolution: not a bug
stage: resolved
2020-03-12 22:10:27eric.smithsetnosy: + eric.smith

messages: + msg364053
versions: + Python 3.9
2020-03-12 21:49:27Seth.Troisicreate