msg303596 - (view) |
Author: Inada Naoki (methane) *  |
Date: 2017-10-03 11:15 |
Currently, strings.Template uses re.IGNORECASE without re.ASCII:
idpattern = r'[_a-z][_a-z0-9]*'
flags = _re.IGNORECASE
[a-z] matches against 'ı' (0x131, DOTLESS I) and 'ſ' (0x17f, LONG S).
It is not intentional, and it makes re.compile slower.
|
msg303618 - (view) |
Author: Barry A. Warsaw (barry) *  |
Date: 2017-10-03 13:51 |
Technically it *is* an API change since `flags` is a part of the public API. The documentation says:
$identifier names a substitution placeholder matching a mapping key of
"identifier". By default, "identifier" is restricted to any case-
insensitive ASCII alphanumeric string (including underscores) that starts
with an underscore or ASCII letter. The first non-identifier character
after the $ character terminates this placeholder specification.
This means if someone does subclass string.Template and changes the pattern to accept Unicode identifiers, then with this change they will also have to modify flags, whereas before they didn't.
It really wasn't ever the intention to allow non-ASCII identifiers, so this is probably safe in practice. OTOH, making the change for performance reasons might be questionable, given that the regular expressions are compiled by the Template's metaclass, so unlikely to contribute significantly to overall performance wins.
|
msg303627 - (view) |
Author: Inada Naoki (methane) *  |
Date: 2017-10-03 16:31 |
> This means if someone does subclass string.Template and changes the pattern to accept Unicode identifiers, then with this change they will also have to modify flags, whereas before they didn't.
Thank you for pointing it out.
I removed re.A flag after compilation.
> OTOH, making the change for performance reasons might be questionable, given that the regular expressions are compiled by the Template's metaclass, so unlikely to contribute significantly to overall performance wins.
original: import time: 2310 | 9589 | string
patched: import time: 1420 | 8702 | string
We can save about 900 us.
|
msg303670 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2017-10-04 05:34 |
When optimizing, please don't make API changes.
|
msg303671 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-10-04 05:55 |
Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.
|
msg303673 - (view) |
Author: Inada Naoki (methane) *  |
Date: 2017-10-04 06:24 |
> When optimizing, please don't make API changes.
This is not only optimization, but bugfix.
Document of string.Template says:
> By default, "identifier" is restricted to any case-insensitive ASCII alphanumeric string (including underscores) that starts with an underscore or ASCII letter.
So, missing re.ASCII flag is bug because non-ASCII alphabet can be matched.
|
msg303674 - (view) |
Author: Inada Naoki (methane) *  |
Date: 2017-10-04 06:29 |
> Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.
https://docs.python.org/3.6/library/re.html#regular-expression-syntax says
`(?imsx-imsx:...)`
Anyway, Template.idpattern is documented public API too.
|
msg303675 - (view) |
Author: Inada Naoki (methane) *  |
Date: 2017-10-04 06:37 |
Current pull request override `Template.flags = re.I` after class creation for backward compatibility, without any API change.
But I'm not sure it's right approach.
How many people who subclass string.Template expect non-ASCII match?
If this change is bugfix for 99% of subclasses, why should we keep pitfall?
|
msg303692 - (view) |
Author: Barry A. Warsaw (barry) *  |
Date: 2017-10-04 13:57 |
On Oct 4, 2017, at 02:29, INADA Naoki <report@bugs.python.org> wrote:
>
> INADA Naoki <songofacandy@gmail.com> added the comment:
>
>> Yet one way -- make re.ASCII a local flag. Than we could just change the idpattern attribute to r'(?a:[_a-z][_a-z0-9]*)', without touching the flags attribute.
>
> https://docs.python.org/3.6/library/re.html#regular-expression-syntax says
> `(?imsx-imsx:...)`
>
> Anyway, Template.idpattern is documented public API too.
Too bad, because I like that approach. How hard would it be to add support for ‘a’ as a local flag? (I’m kind of surprised that it isn’t already supported - it seems like it would be useful.)
I would like this better than hacking Template.flags after the fact. It seems like the right way to align the code with the documentation (i.e. fix a bug).
|
msg303694 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-10-04 14:05 |
See issue31690. But this solution can be used only in 3.7.
|
msg303696 - (view) |
Author: Barry A. Warsaw (barry) *  |
Date: 2017-10-04 14:08 |
On Oct 4, 2017, at 10:05, Serhiy Storchaka <report@bugs.python.org> wrote:
>
> See issue31690. But this solution can be used only in 3.7.
That’s fine. I don’t think this is important enough to backport.
|
msg303844 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-10-06 19:04 |
Another solution (works in 3.6 too): set idpattern to r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)'.
This looks pretty weird, setting the re.IGNORECASE flag and then unsetting it. But it works, and don't break the user code that changes idpattern without changing flags.
|
msg304052 - (view) |
Author: Barry A. Warsaw (barry) *  |
Date: 2017-10-10 15:25 |
On Oct 6, 2017, at 15:04, Serhiy Storchaka <report@bugs.python.org> wrote:
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
>
> Another solution (works in 3.6 too): set idpattern to r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)'.
>
> This looks pretty weird, setting the re.IGNORECASE flag and then unsetting it. But it works, and don't break the user code that changes idpattern without changing flags.
Oh, I think I like that :)
|
msg304053 - (view) |
Author: Barry A. Warsaw (barry) *  |
Date: 2017-10-10 15:25 |
With some comments to clarify of course.
|
msg304313 - (view) |
Author: Inada Naoki (methane) *  |
Date: 2017-10-13 07:02 |
New changeset b22273ec5d1992b0cbe078b887427ae9977dfb78 by INADA Naoki in branch 'master':
bpo-31672: Fix string.Template accidentally matched non-ASCII identifiers (GH-3872)
https://github.com/python/cpython/commit/b22273ec5d1992b0cbe078b887427ae9977dfb78
|
msg304376 - (view) |
Author: Inada Naoki (methane) *  |
Date: 2017-10-14 05:22 |
New changeset 7060380d577690a40ebc201c0725076349e977cd by INADA Naoki in branch '3.6':
bpo-31672: Fix string.Template accidentally matched non-ASCII identifiers (GH-3872)
https://github.com/python/cpython/commit/7060380d577690a40ebc201c0725076349e977cd
|
msg304392 - (view) |
Author: Inada Naoki (methane) *  |
Date: 2017-10-14 12:22 |
New changeset 073150db39408c1800e4b9e895ad0b0e195f1056 by INADA Naoki in branch 'master':
bpo-31672: doc: Remove one sentence from library/string.rst (GH-3990)
https://github.com/python/cpython/commit/073150db39408c1800e4b9e895ad0b0e195f1056
|
msg306350 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-11-16 10:40 |
Is something left to do here?
In 3.7 r'(?-i:[_a-zA-Z][_a-zA-Z0-9]*)' can be replaced with r'(?a:[_a-z][_a-z0-9]*)' if it will add clarity.
|
msg306607 - (view) |
Author: Barry A. Warsaw (barry) *  |
Date: 2017-11-21 01:17 |
@inada.naoki - I think changing this to ?a will make things more clear. I submitted a PR for that, and then once it lands, we can close this issue.
|
msg306651 - (view) |
Author: Barry A. Warsaw (barry) *  |
Date: 2017-11-21 15:28 |
New changeset e256b408889eba867e1d90e5e1a0904843256255 by Barry Warsaw in branch 'master':
bpo-31672 - Add one last minor clarification for idpattern (#4483)
https://github.com/python/cpython/commit/e256b408889eba867e1d90e5e1a0904843256255
|
msg309468 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2018-01-04 14:39 |
PR 5099 reverts in 3.7 a subtle behavior change made by the fix for this issue. Before merging the fix setting flags to 0 in a Template subclass made the default pattern matching only lower case letters. Currently setting flags to 0 doesn't have any effect.
|
msg309472 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2018-01-04 17:20 |
New changeset 87be28f4a1c5b76926c71a3d9f92503f9eb82d51 by Serhiy Storchaka in branch 'master':
bpo-31672: Restore the former behavior when override flags in Template. (#5099)
https://github.com/python/cpython/commit/87be28f4a1c5b76926c71a3d9f92503f9eb82d51
|
msg309475 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2018-01-04 17:40 |
Results.
There are 4 cases:
1) default idpattern and flags
2) overridden only idpattern
3) overridden only flags
4) overridden both idpattern and flags
The case 1 was the one that was broken when this issue was opened. The initial Inada's version fixed the case 1, but broke the case 2. His final version (applied also to 3.6) fixed the case 1, but broke the case 3. This is a win, because cases 1 and 2 look much more common than the case 3. And finally PR 5099 has fixed also the case 3. The case 4 obviously is not affected by any changes of default values.
Now all four cases are correct in 3.7 and the only broken case in 3.6 is the uncommon case 3.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:53 | admin | set | github: 75853 |
2018-01-04 17:40:09 | serhiy.storchaka | set | status: open -> closed resolution: fixed messages:
+ msg309475
stage: resolved |
2018-01-04 17:20:14 | serhiy.storchaka | set | messages:
+ msg309472 |
2018-01-04 14:39:54 | serhiy.storchaka | set | status: closed -> open resolution: fixed -> (no value) messages:
+ msg309468
stage: resolved -> (no value) |
2018-01-04 14:35:12 | serhiy.storchaka | set | pull_requests:
+ pull_request4967 |
2017-11-21 15:28:31 | barry | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2017-11-21 15:28:16 | barry | set | messages:
+ msg306651 |
2017-11-21 01:17:16 | barry | set | messages:
+ msg306607 |
2017-11-21 01:12:45 | barry | set | pull_requests:
+ pull_request4420 |
2017-11-16 10:40:13 | serhiy.storchaka | set | messages:
+ msg306350 |
2017-10-14 12:22:45 | methane | set | messages:
+ msg304392 |
2017-10-14 05:45:03 | methane | set | pull_requests:
+ pull_request3966 |
2017-10-14 05:22:05 | methane | set | messages:
+ msg304376 |
2017-10-13 07:33:11 | methane | set | pull_requests:
+ pull_request3958 |
2017-10-13 07:02:27 | methane | set | messages:
+ msg304313 |
2017-10-10 15:25:52 | barry | set | messages:
+ msg304053 |
2017-10-10 15:25:28 | barry | set | messages:
+ msg304052 |
2017-10-06 19:04:41 | serhiy.storchaka | set | messages:
+ msg303844 |
2017-10-04 14:08:51 | barry | set | messages:
+ msg303696 |
2017-10-04 14:05:39 | serhiy.storchaka | set | messages:
+ msg303694 |
2017-10-04 13:57:11 | barry | set | messages:
+ msg303692 |
2017-10-04 06:37:11 | methane | set | messages:
+ msg303675 |
2017-10-04 06:29:15 | methane | set | messages:
+ msg303674 |
2017-10-04 06:24:18 | methane | set | messages:
+ msg303673 |
2017-10-04 05:55:22 | serhiy.storchaka | set | messages:
+ msg303671 |
2017-10-04 05:34:38 | rhettinger | set | nosy:
+ rhettinger messages:
+ msg303670
|
2017-10-03 16:32:21 | methane | set | type: behavior stage: patch review |
2017-10-03 16:31:54 | methane | set | messages:
+ msg303627 stage: patch review -> (no value) |
2017-10-03 16:27:54 | methane | set | keywords:
+ patch stage: patch review pull_requests:
+ pull_request3851 |
2017-10-03 13:51:50 | barry | set | messages:
+ msg303618 |
2017-10-03 12:54:43 | serhiy.storchaka | set | nosy:
+ barry, serhiy.storchaka
|
2017-10-03 12:50:19 | methane | link | issue31669 superseder |
2017-10-03 11:15:40 | methane | create | |