msg364052 - (view) |
Author: Seth Troisi (Seth.Troisi) * |
Date: 2020-03-12 21:49 |
Following on https://bugs.python.org/issue17087
Today I was mystified by why a regex wasn't working.
>>> import re
>>> re.match(r'.{10}', 'A'*49+'B')
<_sre.SRE_Match object; span=(0, 10), match='AAAAAAAAAA'>
>>> re.match(r'.{49}', 'A'*49+'B')
<_sre.SRE_Match object; span=(0, 49), match='AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA>
>>> re.match(r'.{50}', 'A'*49+'B')
<_sre.SRE_Match object; span=(0, 50), match='AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA>
I became confused on why the B wasn't matching in the third example; It is matching just
in the interactive debugger it doesn't fit on the line and doesn't show
My suggestion would be to truncate match (in the repr) and append '...' when it's right quote wouldn't show
with short matches (or exactly enough space) there would be no change
>>> re.match(r'.{48}', string.ascii_letters)
<_sre.SRE_Match object; span=(0, 48), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV'>
when not all of match can be displayed
>>> re.match(r'.{49}', string.ascii_letters)
<_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW>
<_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS'...>
I'm happy to help out by writing tests or impl if folks thing this is a good idea.
I couldn't think of other examples (urllib maybe?) in Python of how this is handled but I could potentially look for some if that would help
|
msg364053 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2020-03-12 22:10 |
I think the missing closing quote is supposed to be your visual clue that it's truncated. Although I'll grant you that it's pretty subtle.
|
msg371618 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2020-06-16 07:11 |
+1 for adding an ellipsis. It's a conventional way to indicate that the displayed data is truncated.
Concur with Eric that missing close quote is too subtle (and odd, and unexpected).
|
msg371626 - (view) |
Author: Seth Troisi (Seth.Troisi) * |
Date: 2020-06-16 09:49 |
I didn't propose a patch before because I was unsure of decision. Now that there is a +1 from Raymond I'll working on a patch and some documentation. Expect a patch within the week.
|
msg371641 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2020-06-16 12:30 |
There was a discussion in issue40984 that the repr must be eval-able. I don't feel very strongly about this, mainly because I don't think anyone ever does eval(repr(some_regex)). I'd be slightly sympathetic to wanting the eval to fail if the repr had to truncate its output, instead of succeeding because the string was still a valid, but different, regex.
|
msg371645 - (view) |
Author: Quentin Wenger (matpi) |
Date: 2020-06-16 12:42 |
For a bit of background, the other issue is about the repr of compiled patterns, not match objects.
Please see my argument there about the conformance to repr's doc - merely adding an ellipsis would _not_ solve this case.
I have however nothing against the pattern being truncated/ellipsed when inside the repr of a match object.
|
msg371648 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2020-06-16 12:52 |
Ah, I see. I missed that this issue was only about match objects. I apologize for the confusion.
That being the case, I'll re-open the other issue.
|
msg371650 - (view) |
Author: Quentin Wenger (matpi) |
Date: 2020-06-16 12:56 |
@eric.smith thanks, no problem.
If I can give any advice on this present issue, I would suggest to have the ellipsis _inside_ the quote, to make clear that the pattern is being truncated, not the match. So instead of
```
<_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS'...>
```
as suggested by @Seth.Troisi, I'd suggest
```
<_sre.SRE_Match object; span=(0, 49), match='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS...'>
```
|
msg371698 - (view) |
Author: Seth Troisi (Seth.Troisi) * |
Date: 2020-06-16 21:19 |
@matpi
The current behavior is for the right quote to not appear I kept this behavior but happy to consider changing that.
See the linked patch for examples
|
msg371699 - (view) |
Author: Quentin Wenger (matpi) |
Date: 2020-06-16 21:29 |
Oh ok, I was mislead by the example in your first message, where you did have both the quote and ellipsis.
I don't have a strong opinion.
- having the quote is a bit more "clean"
- but not having it makes clear than the pattern is truncated (per se, three dots is a valid pattern)
The best would be to find a precedent in the stdlib, but I currently cannot think of any either.
|
msg371700 - (view) |
Author: Quentin Wenger (matpi) |
Date: 2020-06-16 21:53 |
File objects are an example of a square-bracket repr with string parameters in the repr, but no truncation is performed (see https://github.com/python/cpython/blob/master/Modules/_io/textio.c#L2912).
Various truncations with the same (lack of?) clarity are done in the stdlib, see eg. https://github.com/python/cpython/blob/04fc4f2a46b2fd083639deb872c3a3037fdb47d6/Objects/longobject.c#L2475.
|
msg371843 - (view) |
Author: Seth Troisi (Seth.Troisi) * |
Date: 2020-06-19 00:05 |
I was thinking about how to add the end quote and found these weird cases:
>>> "asdf'asdf'asdf"
"asdf'asdf'asdf"
>>> "asdf\"asdf\"asdf"
'asdf"asdf"asdf'
>>> "asdf\"asdf'asdf"
'asdf"asdf\'asdf'
This means that len(s) +2 (or 3 for bytes) != len(repr(s))
e.g.
>>> s = "\"''''''"
'"\'\'\'\'\'\''
>>> s
>>> len(s)
7
>>> len(repr(s))
15
This can lead to a weird partial trailing character
>>> re.match(".*", "a"*48 + "'\"")
<_sre.SRE_Match object; span=(0, 50), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\>
This means I'll need to rethink len(group0) >= 48 as the condition for truncation (as a 30 length string can be truncated by %.50R)
Maybe it makes sense to write group0 to a temp string and then check if that's truncated and extract the quote character from that
OR
PyUnicode_FromFormat('%R', group0[:50]) # avoids trailing escape character ('\') but might be longer than 50 characters
|
msg371864 - (view) |
Author: Quentin Wenger (matpi) |
Date: 2020-06-19 09:55 |
An extraneous difficulty also exists for bytes regexes, because there non-ascii characters are repr'ed using escape sequences. So there's a risk of cutting one in the middle.
```
>>> import re
>>> re.match(b".*", b"\xce")
<re.Match object; span=(0, 1), match=b'\xce'>
```
|
msg371867 - (view) |
Author: Quentin Wenger (matpi) |
Date: 2020-06-19 10:00 |
And ascii escapes should also not be forgotten.
```
>>> re.match(b".*", b"\t")
<re.Match object; span=(0, 1), match=b'\t'>
>>> re.match(".*", "\t")
<re.Match object; span=(0, 1), match='\t'>
```
|
msg371868 - (view) |
Author: Quentin Wenger (matpi) |
Date: 2020-06-19 10:02 |
(but those are one-character escapes, so that should be fine - either the escape is complete or the backslash is trailing and can be "peeled of")
|
msg371869 - (view) |
Author: Quentin Wenger (matpi) |
Date: 2020-06-19 10:02 |
*off
|
msg371878 - (view) |
Author: Quentin Wenger (matpi) |
Date: 2020-06-19 11:11 |
Other pathological case: literal backslashes
```
>>> re.match(".*", r"\\\\\\")
<re.Match object; span=(0, 6), match='\\\\\\\\\\\\'>
```
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:28 | admin | set | github: 84130 |
2020-06-19 11:11:33 | matpi | set | messages:
+ msg371878 |
2020-06-19 10:02:37 | matpi | set | messages:
+ msg371869 |
2020-06-19 10:02:04 | matpi | set | messages:
+ msg371868 |
2020-06-19 10:00:30 | matpi | set | messages:
+ msg371867 |
2020-06-19 09:55:33 | matpi | set | messages:
+ msg371864 |
2020-06-19 00:05:58 | Seth.Troisi | set | messages:
+ msg371843 |
2020-06-16 21:53:36 | matpi | set | messages:
+ msg371700 |
2020-06-16 21:29:20 | matpi | set | messages:
+ msg371699 |
2020-06-16 21:19:38 | Seth.Troisi | set | messages:
+ msg371698 |
2020-06-16 20:43:12 | Seth.Troisi | set | keywords:
+ patch stage: needs patch -> patch review pull_requests:
+ pull_request20100 |
2020-06-16 12:56:43 | matpi | set | messages:
+ msg371650 |
2020-06-16 12:54:54 | eric.smith | unlink | issue40984 superseder |
2020-06-16 12:52:20 | eric.smith | set | messages:
+ msg371648 |
2020-06-16 12:42:13 | matpi | set | nosy:
+ matpi messages:
+ msg371645
|
2020-06-16 12:30:13 | eric.smith | set | messages:
+ msg371641 |
2020-06-16 12:28:14 | eric.smith | link | issue40984 superseder |
2020-06-16 09:49:49 | Seth.Troisi | set | messages:
+ msg371626 |
2020-06-16 08:16:09 | eric.smith | set | status: closed -> open
nosy:
+ ezio.melotti, mrabarnett components:
+ Regular Expressions, - Library (Lib) resolution: not a bug -> stage: resolved -> needs patch |
2020-06-16 07:11:03 | rhettinger | set | nosy:
+ rhettinger messages:
+ msg371618
|
2020-03-24 20:57:35 | Seth.Troisi | set | status: open -> closed resolution: not a bug stage: resolved |
2020-03-12 22:10:27 | eric.smith | set | nosy:
+ eric.smith
messages:
+ msg364053 versions:
+ Python 3.9 |
2020-03-12 21:49:27 | Seth.Troisi | create | |