Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inspect.getsource returns incorrect source for classes when class definition is part of multiline strings #79294

Closed
tirkarthi opened this issue Oct 30, 2018 · 15 comments
Labels
3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@tirkarthi
Copy link
Member

BPO 35113
Nosy @ambv, @serhiy-storchaka, @1st1, @pablogsal, @Windsooon, @tirkarthi, @cdonovick, @akulakov
PRs
  • bpo-35113: Fix inspect.getsource to return correct source for inner classes #10307
  • bpo-35113: Attempt to optimize inspect.getsource() for classes #19587
  • bpo-35113: inspect: clean up a duplicate import and comment #27073
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-04-29.03:59:29.423>
    created_at = <Date 2018-10-30.12:32:57.221>
    labels = ['type-bug', 'library', '3.9']
    title = 'inspect.getsource returns incorrect source for classes when class definition is part of multiline strings'
    updated_at = <Date 2021-07-13.13:43:05.880>
    user = 'https://github.com/tirkarthi'

    bugs.python.org fields:

    activity = <Date 2021-07-13.13:43:05.880>
    actor = 'lukasz.langa'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-04-29.03:59:29.423>
    closer = 'xtreak'
    components = ['Library (Lib)']
    creation = <Date 2018-10-30.12:32:57.221>
    creator = 'xtreak'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 35113
    keywords = ['patch']
    message_count = 15.0
    messages = ['328909', '329097', '329112', '329180', '329181', '329182', '329183', '329187', '329202', '329203', '329204', '329205', '366727', '367620', '397399']
    nosy_count = 8.0
    nosy_names = ['lukasz.langa', 'serhiy.storchaka', 'yselivanov', 'pablogsal', 'Windson Yang', 'xtreak', 'donovick', 'andrei.avk']
    pr_nums = ['10307', '19587', '27073']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue35113'
    versions = ['Python 3.9']

    @tirkarthi
    Copy link
    Member Author

    inspect.getsource uses inspect.findsource that uses a regex to check for class declaration and starts matching against the regex from the start of the program. When there is a match it checks for the first character to be 'c' to return the line number from which getblock is used to get the class definition. If there are multiple matches and there is no line with 'c' as the first character in the line then they are sorted based on the number of whitespaces preceding class keyword. This poses the following problem :

    1. When a class definition like string occurs as part of a multiline string before the actual definition then this causes the function to return the multiline string instead of actual definition. This occurs both while using the multiline string also as comment and also using them in variable definition in other places.

    2. When the class is defined inside another class and there is a similar multiline string inside another class which is indented along the same indentation level of the class then they are sorted by whitespace where they are equal and then followed by line number. Since the class definition occurs after the multiline string it causes the multiline string to be taken up as the source of the class.

    This was last changed in 89f507f (Dec 2006) which is also a merge commit. I searched for issues and relevant test cases but I couldn't find any in the test suite or issue tracker regarding the same. Hence I am filing a new issue.

    # backups/bpo35101.py

    import inspect
    
    class Bar:
        a = """
    class MultilineStringVariable:
        ...
    """
    
    class MultilineStringVariable:
    
        def foo(self):
            pass

    '''
    class MultilineStringComment:
    pass
    '''

    class MultilineStringComment:
    
        def foo(self):
            pass
    
    class NestedClass:
        a = '''
        class Spam:
            ...
        '''
    
    class Nested:
    
        class Spam:
            pass
    
    print(inspect.getsource(MultilineStringVariable))
    print(inspect.getsource(MultilineStringComment))
    print(inspect.getsource(Nested.Spam))

    # Incorrect results

    $ ./python.exe ../backups/bpo35101.py
    class MultilineStringVariable:
        ...
    class MultilineStringComment:
        pass
    
        class Spam:
            ...

    @tirkarthi tirkarthi added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 30, 2018
    @Windsooon
    Copy link
    Mannequin

    Windsooon mannequin commented Nov 2, 2018

    Yes, the code located at

    if isclass(object):

    I think to use a comment status var would be a solution, Do you have any idea to fix it?

    @tirkarthi
    Copy link
    Member Author

    I think parsing with regex especially on multiline cases like here is little tricky handling all cases. I find AST module to be a good fit. I tried using a callback where every class definition is taken up and then from that line to rest of the program getblock can be used to get the relevant class definition. This takes care of the cases I have listed and returns the correct source where we can filter out the class we need. Hence findsource can be modified to use ast instead of regex given that regex was introduced in 2006 and changing class syntax in future might need updating this regex. A sample program I tried but this is the first time I am using ast module and there might be better ways to filter the required classes. I will try raising a PR after some discussion to see if any better solution exists and more tests.

    # foo_ast.py

    import ast
    import inspect
    
    
    class MyClassVisitor(ast.NodeVisitor):
    
        def __init__(self, lines, *args, **kwargs):
            self.lines = lines
            super().__init__(*args, **kwargs)
    
        def visit_ClassDef(self, node):
            print('Found class "%s"' % node.name)
            print("Source")
            print(''.join(inspect.getblock(self.lines[node.lineno-1:])))
    
    with open('../backups/bpo35101_1.py') as f:
        source = f.read()
        lines = source.splitlines(True)
        tree = ast.parse(source)
        MyClassVisitor(lines).visit(tree)
    $ ./python.exe ../backups/foo_ast.py

    Found class "Bar"
    Source
    class Bar:
    a = """
    class MultilineStringVariable:
    ...
    """

    Found class "MultilineStringVariable"
    Source
    class MultilineStringVariable:

        def foo(self):
            pass

    Found class "MultilineStringComment"
    Source
    class MultilineStringComment:

        def foo(self):
            pass

    Found class "NestedClass"
    Source
    class NestedClass:
    a = '''
    class Spam:
    ...
    '''

    Found class "Nested"
    Source
    class Nested:

        class Spam:
            pass

    @tirkarthi
    Copy link
    Member Author

    I tried ast module and it passes my test cases but I found another case since object.__name__ returns unqualified name thus matching against nested class names might conflict with the ones on the top level. Example is below where there is Spam at the module level and NestedA.Spam which is defined inside another class. getsource(NestedA.Spam) returns the source for Spam which is also unexpected. It seems ast module also doesn't call visit_ClassDef on nested class definitions thus my approach also couldn't detect NestedA.Spam and returns source for Spam. I wish class objects also had some kind of code attribute similar to functions so that line number is attached. I don't know the internals so I might be wrong here. I am adding Yury.

    import inspect
    
    class Egg:
    
        def func(self):
            pass
    
    class NestedA:
    
        class Egg:
            pass
    
    
    print(inspect.getsource(NestedA.Egg))
    print(inspect.getsource(Egg))

    # Output

    class Egg:
    
        def func(self):
            pass
    
    class Egg:
    
        def func(self):
            pass

    @serhiy-storchaka
    Copy link
    Member

    Use __qualname__.

    @tirkarthi
    Copy link
    Member Author

    Thanks Serhiy, object.__qualname__ can be used to get qualified name of the object but ast nodes don't have qualified name for the node where I can match against it. node.name returns unqualified name and hence using ast module I can't differentiate between Spam and NestedA.Spam since both return 'Spam' for node.name . I am also using a recursive decorator found at https://stackoverflow.com/a/14661325/2610955 which solves problem iterating through nested classes. But for matching module level class object and nested class level object when I parse a syntax tree for Spam and NestedA.Spam both nodes for class definition return 'Spam'. Is there a way to get qualified names in ast module or do I need to store it somewhere? My current approach for inspect.findsource is as below :

    class ClassVisitor(ast.NodeVisitor):
    
        def recursive(func):
            """ decorator to make visitor work recursive """
            def wrapper(self, node):
                func(self, node)
                for child in ast.iter_child_nodes(node):
                    self.visit(child)
            return wrapper
    
        def __init__(self, source, name, *args, **kwargs):
            self.source = source
            self.line_number = None
            self.name = name
            super().__init__(*args, **kwargs)
    
        @recursive
        def visit_ClassDef(self, node):
            # Need to match qualified name to differentiate between Spam and NestedA.Spam
            if node.name == self.name:
                # decrement by one since lines list starts with indexing by zero
                self.line_number = node.lineno - 1
    
    name = object.__name__ # can use object.__qualname__ but node.name is not qualified
    source = ''.join(lines)
    tree = ast.parse(source)
    class_visitor = ClassVisitor(source, name)
    class_visitor.visit(tree)
    
    if class_visitor.line_number is not None:
        return lines, class_visitor.line_number
    else:
        raise OSError('could not find class definition')

    @serhiy-storchaka
    Copy link
    Member

    Use a stack of names. Every time when you enter a class and function nodes, push the current name on the stack, and pop it out after handling child nodes. Compare the qualified name with '.'.join(self.stack). Don't use the recursive decorator, it is not suitable for this, because you need to execute some code before and after handling child nodes. Just add a loop for child nodes in your handler.

    Yet one thing that you should take to account: decorators. In you should return the line number of the first decorator if they are used. This is easy.

    This is an interesting issue. I would take it if you do not already working on it. But I think it should be interesting and useful to you. Good practice. Once you get some working code, create a PR. I will suggest next steps.

    @tirkarthi
    Copy link
    Member Author

    Thanks Serhiy, I am going with the same approach too using a Stack. I am working on it and I have fixed the set initial cases I reported with tests. I just stumbled upon the nested definitions and inner classes. I will raise a PR by end of today with my progress so that more cases along with tests can be covered during PR review.

    @serhiy-storchaka
    Copy link
    Member

    Well, now the hardest problem remains.

    Sometimes there are more than one place where the class with the same __qualname__ is defined. See for example multiprocessing/heap.py:

    if sys.platform == 'win32':
        class Arena(object):
            ...
    else:
        class Arena(object):
            ...

    It is hard to determine the correct place. But the current code return the first line, while PR 10307 currently returns the last line.

    If the current behavior is more desirable, the code needs to stop searching after finding the first candidate. And the simplest way is to use exceptions:

    class ClassVisitor(ast.NodeVisitor):
        def visit_ClassDef(self, node):
            ...
            if found:
                raise StopIterator(line_number)
            ...
    ...
    try:
        class_visitor.visit(tree)
    except StopIterator as e:
        line_number = e.value
    else:
        raise OSError('could not find class definition')

    @tirkarthi
    Copy link
    Member Author

    Thanks Serhiy for the details. I think there are also less tests for inspect.findsource with respect to classes though it's more robust than the regex approach. Thus there might be different effects which are correct now or selects an alternate candidate as you have mentioned. I think this is worth adding a test case once the behavior is finalized.

    On a general note I am curious is there any patch to add line number directly to the class as an attribute? I think inspect module uses object.co_firstlineno where object is a code object and works for function object. Is there any technical difficulty in this? I searched the mailing list and PEPs but couldn't come across anything. Also there are couple of issues around showing source from REPL where it's not saved to the file bpo-33826, bpo-12920.

    There is one more frame pointer issue that is not working with class objects with a PR that fixes some more issues in findsource with respect to classes : bpo-35101.

    Thanks

    @serhiy-storchaka
    Copy link
    Member

    As for the line number for class, I have not heard anything about this. I think there is no large technical difficulty, but there is a little need. The first line number of the code object is needed first at all for tracing and debugging. Function has attached code object for executing, thus the line number is available for function. But the code object that creates a class is no longer needed after the class is created. Attaching the line number to the class looks like attaching the line number of the "[]" expression to the list created by it.

    @serhiy-storchaka
    Copy link
    Member

    Yury, could you please take a look?

    @tirkarthi
    Copy link
    Member Author

    New changeset 696136b by Karthikeyan Singaravelan in branch 'master':
    bpo-35113: Fix inspect.getsource to return correct source for inner classes (bpo-10307)
    696136b

    @tirkarthi
    Copy link
    Member Author

    Closing this as fixed with the enhancement to show decorator for classes too for 3.9. Thank you all for the help on this.

    @tirkarthi tirkarthi added 3.9 only security fixes and removed 3.7 (EOL) end of life 3.8 only security fixes labels Apr 29, 2020
    @ambv
    Copy link
    Contributor

    ambv commented Jul 13, 2021

    New changeset d4a5f0b by andrei kulakov in branch 'main':
    bpo-35113: clean up duplicate import and comment (bpo-27073)
    d4a5f0b

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants