This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Faster ElementTree iterating
Type: performance Stage: resolved
Components: Extension Modules, XML Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: brett.cannon, martin.panter, python-dev, scoder, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2015-12-15 19:17 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
etree_iter.patch serhiy.storchaka, 2015-12-15 19:17 review
etree_iter2.patch serhiy.storchaka, 2015-12-16 02:03 review
Messages (6)
msg256472 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-15 19:17
Proposed patch refactors and optimize the code for iterating ElementTree. Refactoring and minor optimization makes the code about 30 lines smaller and 2-3% faster. Using a continuous array instead of linked list makes the code up to 40% faster (lxml benchmark was used to measurement).

In addition the patch fixes possible issue with owning references.
msg256495 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-16 00:47
Looks technically correct as far as my knowledge of the malloc routines goes. What was the problem with references that you fixed? Maybe with parent_stack_push_new() failure?

The main reference counting bug that sticks out to me is with the text and tail elements, because element_get_text() etc return a borrowed reference. It might be good to fix this while you are at it:

>>> element = Element("tag")
>>> class Text:
...     def __bool__(self):
...         element.text = "changed"
...         return True
... 
>>> element.text = Text()
>>> i = element.itertext()
>>> next(i)
Segmentation fault (core dumped)
[Exit 139]
msg256502 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-16 02:02
Good catch Martin. I missed this case. Following patch fixes it.

Other bugs look similar. They presumably can be triggered when remove the element itself during executing the __next__() method. I have no a reproducer though.

I don't know wherever it is worth to fix these bugs in maintained releases.
msg256507 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-16 03:40
The new changes look good.

I see these kind of reference counting bugs as fairly low priority. I suspect there are plenty more in this module, at least Issue 24103 and Issue 24104.
msg256674 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-18 12:07
Thank you for your review Martin. I have opened a separate issue25902 for refcount bugs in iteration. Its patch is purposed for all versions, while this issue is for developer version only.
msg256795 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-12-21 10:44
New changeset 5a5d5268afd5 by Serhiy Storchaka in branch 'default':
Issue #25873: Optimized iterating ElementTree.
https://hg.python.org/cpython/rev/5a5d5268afd5
History
Date User Action Args
2022-04-11 14:58:25adminsetgithub: 70061
2015-12-21 11:51:17serhiy.storchakasetstatus: open -> closed
assignee: serhiy.storchaka
resolution: fixed
stage: patch review -> resolved
2015-12-21 10:44:27python-devsetnosy: + python-dev
messages: + msg256795
2015-12-18 12:07:02serhiy.storchakasetmessages: + msg256674
2015-12-16 03:40:55martin.pantersetmessages: + msg256507
2015-12-16 02:03:45serhiy.storchakasetfiles: + etree_iter2.patch
2015-12-16 02:03:29serhiy.storchakasetfiles: - etree_iter2.patch
2015-12-16 02:02:52serhiy.storchakasetfiles: + etree_iter2.patch

messages: + msg256502
2015-12-16 00:47:37martin.pantersetnosy: + martin.panter
messages: + msg256495
components: + XML
2015-12-15 19:17:03serhiy.storchakacreate