This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author sping
Recipients StyXman, sping
Date 2022-01-26.21:32:55
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1643232775.76.0.358495635944.issue38487@roundup.psfhosted.org>
In-reply-to
Content
Hi StyXman,

I had a closer look at the files you shared, thanks for those, very helpful!

What I found is that expat_test.py uses a single scalar variable
(_DictSAXHandler.parser) to keep track of the related parser, while it would
need a stack to allow recursion.  In a way, the current approach is equivalent
to walking up the stack as expected but never going back down.
Once I make the code use a stack, the loop goes away.  I'm pasting the patch
inline (with two spaces indented globally) below.

During debugging, these are commands I used to compare internal libexpat behavior,
that may be of interest:

  EXPAT_ACCOUNTING_DEBUG=2 python expat_test.py |& sed 's,0x[0-9a-f]\+,XXX,' | tee pyexpat.txt

  EXPAT_ACCOUNTING_DEBUG=2 xmlwf -x test1.xml |& sed 's,0x[0-9a-f]\+,XXX,' | tee xmlwf.txt

  diff -u xmlwf.txt pyexpat.txt

Here's how I quick-fixed expat_test.py to make things work:

  # diff -u expat_test.py_ORIG expat_test.py
  --- expat_test.py_ORIG  2022-01-26 21:15:27.506458671 +0100
  +++ expat_test.py       2022-01-26 22:15:08.741384932 +0100
  @@ -7,11 +7,21 @@
   
       parser.ExternalEntityRefHandler = handler.externalEntityRef
   
  -    # store the parser in the handler so we can recurse
  -    handler.parser = parser
  -
   
   class _DictSAXHandler(object):
  +    def __init__(self):
  +        self._parsers = []
  +        
  +    def push_parser(self, parser):
  +        self._parsers.append(parser)
  +    
  +    def pop_parser(self):
  +        self._parsers.pop()
  +
  +    @property
  +    def parser(self):
  +        return self._parsers[-1]
  +
       def externalEntityRef(self, context, base, sysId, pubId):
           print(context, base, sysId, pubId)
           external_parser = self.parser.ExternalEntityParserCreate(context)
  @@ -19,7 +29,9 @@
           setup_parser(external_parser, self)
           f = open(sysId, 'rb')
           print(f)
  +        self.push_parser(external_parser)
           external_parser.ParseFile(f)
  +        self.pop_parser()
           print(f)
   
           # all OK
  @@ -36,12 +48,13 @@
           namespace_separator
       )
       setup_parser(parser, handler)
  +    handler.push_parser(parser)
   
       if hasattr(xml_input, 'read'):
           parser.ParseFile(xml_input)
       else:
           parser.Parse(xml_input, True)
  -    return handler.item
  +    # return handler.item  # there is no .item
   
   
   parse(open('test1.xml', 'rb'))
   
What do you tink?

PS: Please note that processing external entities has security implications
    (see https://en.wikipedia.org/wiki/XML_external_entity_attack).

Best, Sebastian
History
Date User Action Args
2022-01-26 21:32:55spingsetrecipients: + sping, StyXman
2022-01-26 21:32:55spingsetmessageid: <1643232775.76.0.358495635944.issue38487@roundup.psfhosted.org>
2022-01-26 21:32:55spinglinkissue38487 messages
2022-01-26 21:32:55spingcreate