classification
Title: Strange XPath search behavior of xml.etree.ElementTree.Element.find
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, robpats, scoder
Priority: normal Keywords:

Created on 2021-01-11 17:12 by robpats, last changed 2021-01-15 01:53 by robpats.

Messages (3)
msg384851 - (view) Author: (robpats) Date: 2021-01-11 17:12
Python 3.6.8 / 3.7.9 / 3.8.7

>>> import xml.etree.ElementTree
>>> e = xml.etree.ElementTree.fromstring('<html><div class="row"/><hr/><div/><hr/><div class="row"/><button/></html>')
>>> list(e)
[<Element 'div' at 0x00000000024CD220>, <Element 'hr' at 0x00000000024CD2C0>, <Element 'div' at 0x00000000024F90E0>, <Element 'hr' at 0x00000000024F9130>, <Element 'div' at 0x00000000024F9180>, <Element 'button' at 0x00000000024F91D0>]
>>> e.find("./div[1]")
<Element 'div' at 0x00000000024CD220>
>>> e.find("./div[2]")
<Element 'div' at 0x00000000024F90E0>
>>> e.find("./div[3]")
<Element 'div' at 0x00000000024F9180>
>>> e.find("./hr[1]")
<Element 'hr' at 0x00000000024CD2C0>
>>> e.find("./hr[2]")
<Element 'hr' at 0x00000000024F9130>



# The following different from XPath implementation in Firefox
# https://developer.mozilla.org/en-US/docs/Web/XPath/Snippets

>>> list(e.iterfind("./*"))
[<Element 'div' at 0x00000000024CD220>, <Element 'hr' at 0x00000000024CD2C0>, <Element 'div' at 0x00000000024F90E0>, <Element 'hr' at 0x00000000024F9130>, <Element 'div' at 0x00000000024F9180>, <Element 'button' at 0x00000000024F91D0>]
>>> e.find("./*[1]")
<Element 'div' at 0x00000000024CD220>
>>> e.find("./*[2]")
<Element 'div' at 0x00000000024F90E0>   <-- should be 'hr', same as e.find("./div[2]") instead of e[2]
>>> e.find("./*[3]")
<Element 'div' at 0x00000000024F9180>   <-- same as e.find("./div[3]") instead of e[3]
>>> e.find("./*[4]")


>>> list(e.iterfind("./*[@class='row']"))
[<Element 'div' at 0x00000000024CD220>, <Element 'div' at 0x00000000024F9180>]
>>> e.find("./*[@class='row'][1]")
<Element 'div' at 0x00000000024CD220>
>>> e.find("./*[@class='row'][2]")
>>> e.find("./*[@class='row'][3]")
<Element 'div' at 0x00000000024F9180>   <--- cannot find element at [2] but found at [3]
msg385011 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-01-13 10:32
etree's find method supports a limited subset of XPath, https://docs.python.org/3/library/xml.etree.elementtree.html#supported-xpath-syntax . e.find("./*[2]") seems to trigger undefined behavior. The limited XPath syntax for positions is documented as "position predicates must be preceded by a tag name".

lxml behaves the same. Its find() method returns the same value and its xpath() method your expected value:

>>> import lxml.etree
>>> e = lxml.etree.fromstring('<html><div class="row"/><hr/><div/><hr/><div class="row"/><button/></html>')
>>> e.find("./*[2]")
<Element div at 0x7fe4d777b6c0>
>>> e.xpath("./*[2]")
[<Element hr at 0x7fe4d777b2c0>]
msg385094 - (view) Author: (robpats) Date: 2021-01-15 01:53
Thanks for the pointer. I didn't notice this paragraph.
xml.etree.ElementTree.Element.find currently returns None if XPath expression is invalid or unsupported. I think it should also return None if position predicates are not preceded by a tag name. It would be even better to emit warnings or raise exceptions to indicate any errors.
History
Date User Action Args
2021-01-15 01:53:49robpatssetmessages: + msg385094
2021-01-13 10:32:02christian.heimessetnosy: + christian.heimes, scoder
messages: + msg385011
2021-01-11 17:12:45robpatscreate