Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string.Template should allow inspection of identifiers #90465

Closed
benkehoe mannequin opened this issue Jan 8, 2022 · 12 comments
Closed

string.Template should allow inspection of identifiers #90465

benkehoe mannequin opened this issue Jan 8, 2022 · 12 comments
Labels
3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@benkehoe
Copy link
Mannequin

benkehoe mannequin commented Jan 8, 2022

BPO 46307
Nosy @warsaw, @serhiy-storchaka, @miss-islington, @benkehoe
PRs
  • bpo-46307: Add string.Template.get_identifiers() method #30493
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-01-12.22:12:21.559>
    created_at = <Date 2022-01-08.20:12:29.789>
    labels = ['type-feature', 'library', '3.11']
    title = 'string.Template should allow inspection of identifiers'
    updated_at = <Date 2022-01-12.22:12:21.559>
    user = 'https://github.com/benkehoe'

    bugs.python.org fields:

    activity = <Date 2022-01-12.22:12:21.559>
    actor = 'AlexWaygood'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-01-12.22:12:21.559>
    closer = 'AlexWaygood'
    components = ['Library (Lib)']
    creation = <Date 2022-01-08.20:12:29.789>
    creator = 'ben11kehoe'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46307
    keywords = ['patch']
    message_count = 11.0
    messages = ['410112', '410127', '410128', '410129', '410131', '410144', '410148', '410172', '410174', '410175', '410322']
    nosy_count = 4.0
    nosy_names = ['barry', 'serhiy.storchaka', 'miss-islington', 'ben11kehoe']
    pr_nums = ['30493']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue46307'
    versions = ['Python 3.11']

    @benkehoe
    Copy link
    Mannequin Author

    benkehoe mannequin commented Jan 8, 2022

    Currently, the only thing that can be done with a string.Template instance and a mapping is either attempt to substitute with substitute() and catch a KeyError if some identifier has not been provided in the mapping, or substitute with safe_substitute() and not know whether all identifiers were provided.

    I propose adding a method that returns the identifiers in the template. Because the template string and pattern are exposed, this is already possible as a separate function:

    def get_identifiers(template):
        return list(
            set(
                filter(
                    lambda v: v is not None,
                    (mo.group('named') or mo.group('braced') 
                     for mo in template.pattern.finditer(template.template))
                )
            )
        )

    However, this function is not easy for a user of string.Template to construct without learning how the template pattern works (which is documented but intended to be learned only when subclassing or modifying id patterns).

    As a method on string.Template, this would enable use cases like more comprehensive error handling (e.g., finding all missing mapping keys at once) or interactive prompting.

    @benkehoe benkehoe mannequin added 3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jan 8, 2022
    @warsaw
    Copy link
    Member

    warsaw commented Jan 9, 2022

    I've never personally needed this, but I could see where it could come in handy.

    I wonder if __iter__() would be the right API for that? I wonder then if we should also implement __contains__()?

    Would you be interested in creating a PR for the feature?

    @benkehoe
    Copy link
    Mannequin Author

    benkehoe mannequin commented Jan 9, 2022

    Happy to make a PR! In my mind I had been thinking it would be the get_identifiers() method with the implementation above, returning a list.

    As for __iter__, I'm less clear on what that would look like:

    t = string.Template(...)
    for identifier in t:
      # what would I do here?
      # would it include repeats if they appear more than once in the template?

    I guess there are two ways to think about it: one is "what identifiers are in this template?" which I think should return a list with no repeats, which I can then iterate over or check if a value is in it. The other is, "what are the contents of the template?" in the style of string.Formatter.parse().

    Given that string.Template is supposed to be the "simple, no-frills" thing in comparison to string.Formatter, I see less use for the latter option.

    @warsaw
    Copy link
    Member

    warsaw commented Jan 9, 2022

    I think you’re right that the iterator API isn’t very helpful. I also agree that you probably really want to answer the “why identifiers are in this template?” question. As for repeats, there’s two ways to think about it. You could return all the identifiers in the order in which they’re found in the template (and you can unique-ify them if you want by passing that list to set()). But maybe you don’t really need that either.

    get_identifiers() works for me!

    On Jan 8, 2022, at 18:51, Ben Kehoe <report@bugs.python.org> wrote:

    Ben Kehoe <ben@kehoe.io> added the comment:

    Happy to make a PR! In my mind I had been thinking it would be the get_identifiers() method with the implementation above, returning a list.

    As for __iter__, I'm less clear on what that would look like:

    t = string.Template(...)
    for identifier in t:

    what would I do here?

    would it include repeats if they appear more than once in the template?

    I guess there are two ways to think about it: one is "what identifiers are in this template?" which I think should return a list with no repeats, which I can then iterate over or check if a value is in it. The other is, "what are the contents of the template?" in the style of string.Formatter.parse().

    Given that string.Template is supposed to be the "simple, no-frills" thing in comparison to string.Formatter, I see less use for the latter option.

    @benkehoe
    Copy link
    Mannequin Author

    benkehoe mannequin commented Jan 9, 2022

    I opened a PR. By default, it raises an exception if there's an invalid identifier; there's a keyword argument raise_on_invalid to control that.

    The implementation I have adds them to a set first, which means the order is not guaranteed. I'm of two minds about this: if there's a big template, you want to gather the identifiers in a set so uniqueness is checked immediately and O(1) and without duplication. On the other hand, if templates are never very big, you could build a list (in order) and check it, O(N) style, or build a duplicate list and set in parallel. Or build a big list and check it at the end.

    I kind of think ordering doesn't matter? What would someone do with that information?

    @benkehoe
    Copy link
    Mannequin Author

    benkehoe mannequin commented Jan 9, 2022

    Having slept on it, I realized that if I was presenting interactive prompts for a template, I would expect the prompts to be in order that the identifiers appear in the template. Accordingly, I've updated the PR to maintain ordering.

    @serhiy-storchaka
    Copy link
    Member

    What are the use cases for this feature?

    @benkehoe
    Copy link
    Mannequin Author

    benkehoe mannequin commented Jan 9, 2022

    The point is to be able to programmatically determine what is needed for a
    successful substitute() call. A basic use case for this is better error
    messages; calling substitute() with an incomplete mapping will tell you
    only the first missing identifier it encounters; if you know all the
    identifiers you can raise an error about all the missing identifiers.
    Another error handling use case is checking whether the template is valid,
    without needing to provide a complete mapping. A use case unrelated to
    error handling that I’ve encountered a few times is interactive prompting
    for template values, which you can only do if you can get a list of the
    identifiers in the template.

    @serhiy-storchaka
    Copy link
    Member

    The simplest way of collecting template names is to use a defaultdict:

    >>> d = collections.defaultdict(str)
    >>> string.Template('$a $b').substitute(d)
    ' '
    >>> d.keys()
    dict_keys(['a', 'b'])

    You can use a custom mapping if you need special handling of absent keys.

    @benkehoe
    Copy link
    Mannequin Author

    benkehoe mannequin commented Jan 9, 2022

    That doesn’t really seem like a Pythonic way of extracting that
    information? Nor does it seem like it would be an obvious trick for the
    average developer to come up with. A method that provides the information
    directly seems useful.

    @miss-islington
    Copy link
    Contributor

    New changeset dce642f by Ben Kehoe in branch 'main':
    bpo-46307: Add string.Template.get_identifiers() method (GH-30493)
    dce642f

    @jgiacobbi
    Copy link

    I have a use case that has been made more complicated by the decision not to list identifiers with duplicates. I am substituting identifiers with capture groups to match output strings. Since re does not allow repeated named capture groups, I have to use unnamed capture groups and match them to their identifiers after the fact. So that leaves three cases:

    1. No repeats, i.e. len(groups) == len(ids)
    2. Repeats in the template but not the groups, so deduplicating the capture groups matches the template since they are both in order
    3. The worst case, repeats in the capture group contents that could fit into more than one identifier and are not actually duplicates.

    If the identifiers were listed naively then every case would be case 1.

    This is the code I have to work around this problem:

        def parse_groups(self, groups: tuple[str]) -> dict[str, str]:
            ids = self.template.get_identifiers()
            if len(ids) == len(groups):
                return dict(zip(ids, groups))
    
            # fastest way to dedup and preserve order in 3.7+
            unique = list(dict.fromkeys(groups))
            if len(ids) == len(unique):
                return dict(zip(ids, unique))
    
            known = set()
            matched = {}
    
            found = iter(groups)
    
            # ids and groups are in the order found in the string being matched
            # duplicates can be skipped, but sometimes a duplicate isn't a duplicate,
            # ex. (\d{2})/(\d{2}) - 15/15 - hour/minute
            for id in ids:
                try:
                    while value := next(found):
                        if re.match(self.patterns[id], value):
                            matched[id] = value
                            known.add(value)
                            break
                        elif value in known:
                            break  # skip duplicates
                except StopIteration:
                    break
    
            if len(ids) == len(matched):
                return matched
    

    The pull request to change this behavior is simple and obvious, but I haven't made it because I have questions

    1. Am I just wrong? Is there an easy way to do this that I am overlooking?
    2. Is there a good reason not to leave it up to the consumer if they want to convert to a set or an ordered set? Doing it for them destroys information.
    3. If both use cases are valuable then why not support both?

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants