New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xml.etree.ElementTree.Element is no longer pickleable #60280
Comments
The xml.etree.ElementTree.Element class is no longer pickleable in Python 3.3.0rc3 . This is a regression from Python 3.2 where the Element class was pickleable, while the xml.etree.cElementTree.Element was not. So this is probably related to switching the ElementTree implementation in Python 3.3. I've looked at the "what's new" documentation in Python 3.3, and there was nothing that indicated that the ElemenTree switchover would give this kind of issues. I've run into this issues because I use the multiprocessing module to pass parsed XML documents between processes, and when I can't pickle Element objects. This stops working. You can reproduce the issue with the following code: print(pickle.dumps(Element("foo"))) |
In 3.2 repr(xml.etree.ElementTree.Element) is "<class 'xml.etree.ElementTree.Element'>". |
Will look into it, thanks for the report. |
This is because, in Python 3.3, xml.etree.ElementTree.Element is overridden by the C accelerator version, _elementtree.Element. If you remove the statement, 1708 from _elementtree import * the behavior from Python 3.2 comes back. |
OTOH, xml.etree.cElementTree.Element in Python 3.2 and earlier has never been pickleable, either. Python 3.2.3+ (3.2:24499eebbc2f, Oct 10 2012, 13:54:45)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.etree.cElementTree import Element
>>> repr(Element)
'<built-in function Element>'
>>> import pickle
>>> pickle.dumps(Element('foo'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <class 'Element'>: attribute lookup builtins.Element failed |
Attached patch for your consideration. I've tested pickling/unpickling and comparing the resulting object attribute by attribute (.tag, .attrib, .text, .tail for equality; and recursively .getchildren()), and 'make test' --- all seems to work. If the approach is sound, I can submit a revised patch that also updates documentation, regression tests, and potententially updates other type objects in the same manner as this patch updates _elementtree.Element . |
TreeBuilder was also pickleable in 3.2 but now isn't so; updating summary accordingly. Attaching a checkpoint patch. It addresses both Element and TreeBuilder, and adds tests. The added tests fail if test___all__ is included in the test run, so the patch should not be applied as-is. Thanks to Ezio Melotti for preliminary review and suggestions on IRC. |
Reattaching without the unrelated Python-ast.c change. |
Thanks for working on it, Daniel. Unfortunately I won't have time to look at it in the near future, but I will definitely look at the patch once I get some free time to hack on Python again. |
@eli, thanks. In the meantime, attaching a patch addressing Ezio's review; the caveats from msg 177134 still apply. |
Haven't had a chance to look at the test___all__ -related failures yet. Still planning to do so once I have some time. |
Re the problem from msg 177134 ('./python -mtest test___all__ test_xml_etree' failing), it is a a preexisting problem, in the sense that applying just the Lib/test/ part of this patch suffices to trigger it. Adding 'xml.etree' to the blacklist in test___all__.py:50 suffices to address the problem. At a guess that is because when test___all__ imports the 'xml.etree' module, it doesn't filter out the '_elementtree' symbol like test_xml_etree.py:test_main() does. New patch attached; differences to previous version:
|
Daniel, is your patch made vs. the 3.3 branch? I'll need to apply there first, and then merge up to default (3.4). [Also, removing the 3.2 tag here. 3.2 won't be fixed to make _elementtree pickleable - it never was]. |
I wrote the patch against default (3.4), but it applies cleanly to 3.3. |
Also, could you explain what makes test___all__ start failing with this patch? What are you adding that makes that happen? P.S. I suspect the root reason is the bad way etree tests are structured in general. See bpo-15083 |
Any attempt to pickle an Element object in test_xml_etree.py causes that test to fail if test___all__ had been run before it; see attached transcript. It's against 3.4, with no changes other than the testsuite changes shown within. I don't know what the root cause is. As noted before, adding 'xml.etree' to test___all__.AllTest.test_all.blacklist is a workaround, so I assume the root cause lies within the test framework --- a bad interaction between the import magics in test___all__ and test_xml_etree. |
I think I understand what's going on there. Pickle, to make sure an object can be picked, looks at the module it comes from and tries to import its class, to make sure they're the same. What happens since 3.3 is that for the Python Element class, it imports ElementTree. The harness in test_xml_etree carefully sets up to ignore _elementtree so the correct class is pulled. However, if test___all__ runs before it it just does a brute "from ElementTree import *" which places the C version in sys.modules, and this is what pickle finds. So in the meantime, until bpo-15083 is resolved I think it's safe to put ET in the ignore list of test___all__. |
New changeset 71508fc738bb by Eli Bendersky in branch '3.3': New changeset 5a38f4d7833c by Eli Bendersky in branch 'default': |
I've added some (currently pyET specific) pickling tests. Daniel, could you re-generate the patch? Note that for the C version of pickling you can now enable the pickle test. Point to consider - can elements pickled in C be unpickled in Python and vice versa? It should probably work (see test_decimal's pickling tests). |
Attached. Differences to previous version:
|
BTW, maybe I'm just confused, but I'm a little surprised that C pickling -> Python unpickling works: the C __reduce__ returns a 3-element tuple, which according to pickle docs[1] must either be a dictionary or be passed to __setstate__; it's not a dictionary and the Python implementation doesn't have a __setstate__. http://docs.python.org/3/library/pickle#pickle.object.\_\_reduce__ |
A couple of high-level questions about the C code (this isn't a detailed review yet):
|
Eli Bendersky wrote on Tue, Jan 01, 2013 at 15:54:00 +0000:
With unmodified tip of 3.4: >>> import pickle, xml.etree.ElementTree as ET
>>> pickle.dumps(ET.Element('foo'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <class 'Element'>: attribute lookup builtins.Element failed I added the "_elementtree" to the tp_name in order to bypass the above
That makes sense. But going forward it might be even better to define |
Eli Bendersky wrote on Tue, Jan 01, 2013 at 00:32:51 +0000:
Maybe I was simply imitating what some other extension module was doing ;) By using __reduce__ with the type as first return value, the
It seems there might be room for code reuse --- both functions set |
Also, the class inheritance in the tests should be amended to follow bpo-16835. |
On Tue, Jan 1, 2013 at 2:56 PM, Daniel Shahaf <report@bugs.python.org>wrote: >
> Daniel Shahaf added the comment:
>
> Eli Bendersky wrote on Tue, Jan 01, 2013 at 15:54:00 +0000:
> > Why did you change the class name, by the way, I don't think it's
> > a valid change at least for 3.3 in terms of backwards compatibility.
> >
>
> With unmodified tip of 3.4:
>
> >>> import pickle, xml.etree.ElementTree as ET
> >>> pickle.dumps(ET.Element('foo'))
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> _pickle.PicklingError: Can't pickle <class 'Element'>: attribute
> lookup builtins.Element failed
>
> I added the "_elementtree" to the tp_name in order to bypass the above
> error. Module-qualified names were in use elsewhere (including by
> _elementtree._element_iterator) so it seemed reasonable. I'll defer to
> you about compatibility implications of this change. I asked on pydev, but this is a key point to resolve. Can If this change is required (even if we choose to name it Danial, could you investigate if such a change is absolutely required to |
Eli Bendersky wrote on Thu, Jan 03, 2013 at 14:44:02 +0000:
Is there a requirement that it loads a particular module? Would etree
Yes, I'll look into that. |
Daniel Shahaf <report@bugs.python.org> wrote:
Yes: If you send an "_elementtree.Element" pickle to another machine that |
On Jan 4, 2013 2:09 AM, "Stefan Krah" <report@bugs.python.org> wrote:
Yes this is a good point. Another thing to consider is that if both report
|
Eli Bendersky wrote on Thu, Jan 03, 2013 at 14:44:02 +0000:
There are a few options:
I haven't tried to implement either of these approaches. |
I think that changing the _elementtree's Element's name to match the Python version, and then making sure the same serialized object is returned - is a worthy option to try. Then it will hopefully "just work". When pickle deserializes a user-defined object that says it's a xml.etree.ElementTree.Element, it will try to import Element from xml.etree.ElementTree and should get the accelerated class. On machines without acceleration it will get the Python class. |
In the meantime, unrelated question: should TreeBuilder be pickleable too, and interchangeably (pickle|unpickle)able between the Python and C implementations? Asking because the Python TreeBuilder has a _factory member which it initializes to 'Element', and I'm not quite sure how cross-pickling should handle that. |
New iteration. Open issues:
Differences to previous version:
|
Daniel, thanks for this patch, it looks very good. I had some comments in the code review. As for TreeBuilder, let's split it into a separate issue - I think it's much less critical. We can discuss it later once we're done with Element. If anything, it's more important to handle ElementTree itself first - it should be very easy (much simpler than Element). With the factory method I'm not sure how the Py pickling happens. I need to dig a bit in pickling for that, because I don't know how method reference pickling works. |
P = import_fresh_module('xml.etree.ElementTree', blocked=['_elementtree'])
tb = P.TreeBuilder(element_factory=lambda a, b: [a, b])
print(pickle.dumps(tb)) Gives: _pickle.PicklingError: Can't pickle <class 'function'>: attribute lookup builtins.function failed |
Eli Bendersky wrote on Tue, Jan 08, 2013 at 15:00:42 +0000:
Is that with or without the patch? |
On Tue, Jan 8, 2013 at 2:51 PM, Daniel Shahaf <report@bugs.python.org>wrote:
Without. Pickle can't handle functions that are not top-level, so bound |
Dissociating TreeBuilder from this issue per recent comments. |
v8 removes TreeBuilder handling (code + tests) and removes memcpy per review. (Outstanding issues in the review can be fixed post-v8, if needed.) |
New changeset 8d6dadfecf22 by Eli Bendersky in branch '3.3': New changeset 4c268b7c86e6 by Eli Bendersky in branch 'default': |
Fixed in 3.3 and default. Thanks for the good work, Daniel. A separate issue can be opened for TreeBuilder. |
New changeset c46054b49b6c by Eli Bendersky in branch '3.3': |
The fix introduced some refleaks: One seems to be in __getstate__:
diff --git a/Modules/_elementtree.c b/Modules/_elementtree.c
--- a/Modules/_elementtree.c
+++ b/Modules/_elementtree.c
@@ -859,8 +859,10 @@
PICKLED_ATTRIB, self->extra->attrib,
PICKLED_TEXT, self->text,
PICKLED_TAIL, self->tail);
- if (instancedict)
+ if (instancedict) {
+ Py_DECREF(children);
return instancedict;
+ }
else {
for (i = 0; i < PyList_GET_SIZE(children); i++)
Py_DECREF(PyList_GET_ITEM(children, i)); I'm still looking for the other. |
This is what I found out. Original leaky code (test_xml_etree_c leaked [56, 56] references, sum=112):
>>> from xml.etree import ElementTree as ET; e = ET.Element('foo'); SAMPLE_XML = "<body><tag class='a'>text</tag><tag class='b' /><section><tag class='b' id='inner'>subtext</tag></section></body>"; e2 = ET.XML(SAMPLE_XML); SAMPLE_XML = "<body><tag class='a'>text</tag><tag class='b' /><tag class='b' /><tag class='b' /><section><tag class='b' id='inner'>subtext</tag></section></body>"; e3 = ET.XML(SAMPLE_XML)
[76773 refs]
>>> ### e has no children and leaks 1 ref:
>>> e.__getstate__()
{'tag': 'foo', 'attrib': {}, 'text': None, 'tail': None, '_children': []}
[76791 refs]
>>> e.__getstate__()
{'tag': 'foo', 'attrib': {}, 'text': None, 'tail': None, '_children': []}
[76792 refs]
>>> e.__getstate__()
{'tag': 'foo', 'attrib': {}, 'text': None, 'tail': None, '_children': []}
[76793 refs]
>>> ### e2 has 3 children and leaks 4 refs:
>>> e2.__getstate__()
{'tag': 'body', 'attrib': {}, 'text': None, 'tail': None, '_children': [<Element 'tag' at 0xb735cef4>, <Element 'tag' at 0xb7368034>, <Element 'section' at 0xb73688f4>]}
[76798 refs]
>>> e2.__getstate__()
{'tag': 'body', 'attrib': {}, 'text': None, 'tail': None, '_children': [<Element 'tag' at 0xb735cef4>, <Element 'tag' at 0xb7368034>, <Element 'section' at 0xb73688f4>]}
[76802 refs] The leaked refs seems to be 1 for the children list + 1 for each children. With that patch applied we go down to test_xml_etree_c leaked [6, 6] references, sum=12.
The remaining leaks seem to be in __setstate__.
Patched code:
>>> from xml.etree import ElementTree as ET; e = ET.Element('foo'); SAMPLE_XML = "<body><tag class='a'>text</tag><tag class='b' /><section><tag class='b' id='inner'>subtext</tag></section></body>"; e2 = ET.XML(SAMPLE_XML); SAMPLE_XML = "<body><tag class='a'>text</tag><tag class='b' /><tag class='b' /><tag class='b' /><section><tag class='b' id='inner'>subtext</tag></section></body>"; e3 = ET.XML(SAMPLE_XML)
[76773 refs]
>>> ### no more leaks for getstate:
>>> p = e.__getstate__()
[76787 refs]
>>> p = e.__getstate__()
[76787 refs]
>>> ### also no leaks when there are no child:
>>> e.__setstate__(p)
[76788 refs]
>>> e.__setstate__(p)
[76788 refs]
>>> ### no more leaks for getstate with children:
>>> p2 = e2.__getstate__()
[76807 refs]
>>> p2 = e2.__getstate__()
[76807 refs]
>>> ### one ref leaked for every child in __setstate__:
>>> e2.__setstate__(p2)
[76810 refs]
>>> e2.__setstate__(p2)
[76813 refs]
>>> e2.__setstate__(p2)
[76816 refs] I'm not working on this anymore now, so someone more familiar with the code can take a look, see if my patch is correct, and fix the remaining leaks.
|
Ezio Melotti wrote on Fri, Jan 11, 2013 at 08:44:43 +0000:
> >>> ### one ref leaked for every child in __setstate__:
> >>> e2.__setstate__(p2)
> [76810 refs]
> >>> e2.__setstate__(p2)
> [76813 refs]
> >>> e2.__setstate__(p2)
> [76816 refs]
>
> I'm not working on this anymore now, so someone more familiar with the
> code can take a look, see if my patch is correct, and fix the
> remaining leaks.
> Thank for narrowing it down so much. Attached patch incorporates your |
New changeset 5b36768b9a11 by Eli Bendersky in branch '3.3': New changeset 848738d3c40f by Eli Bendersky in branch 'default': |
New changeset 4501813ea676 by Eli Bendersky in branch '3.3': New changeset 7313096e0bad by Eli Bendersky in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: