classification
Title: Data truncation in expat parser
Type: behavior Stage: needs patch
Components: Documentation Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Maciek.J, docs@python, eric.araujo, fdrake, r.david.murray
Priority: normal Keywords: patch

Created on 2010-10-20 01:43 by Maciek.J, last changed 2011-08-19 12:58 by fdrake.

Files
File name Uploaded Description Edit
pyxml_error.zip Maciek.J, 2010-10-20 01:43
xml-parse-revisions.py r.david.murray, 2010-10-20 11:54
pyexpat.rst.patch Maciek.J, 2010-10-22 00:45 Patch for docs review
pyexpat.rst.patch Maciek.J, 2010-11-13 23:31 review
pyexpat.rst.patch eric.araujo, 2011-08-19 12:27 review
Messages (10)
msg119184 - (view) Author: Maciek J (Maciek.J) Date: 2010-10-20 01:43
Not sure if this is a Python problem or an expat problem, but I get truncated data while parsing XML documents.

This particular project is for parsing an XML file of Wikipedia dump.

The attached files are:
* xml-parse-revisions.py - parser script
* revision-test.xml - input XML
* revision-test.xml.sql - output XML
* revision_create.sql - not really needed for this test case, but attached for completeness

You can notice that the output file sometimes contains too short values for the "timestamp". Also note that if you add whitespace to the input XML, then different timestamps will be truncated.

My Python is 2.6.6.
msg119202 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-20 11:54
For other reviewers, I'm reposting just his python program as a text file.

Maciek: I myself don't know enough about expat to comment, but is it possible you have an issue similar to issue 10026?
msg119229 - (view) Author: Maciek J (Maciek.J) Date: 2010-10-20 18:05
Hm... It turns out that there is a "buffer_text" attribute:
http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.buffer_text
And setting this attribute to "True" seems to solve the problem.

It solves my problem, but docs are still very confusing. I see two things that should be fixed:
1. In CharacterDataHandler description it should be explicitly noted that data may be chunked even if it is short(!).
2. Description of buffer_text attribute should contain a notice that data may also be arbitrary chunked if this is set to False. My data _was_not_ chunked at new line characters (as the description suggest). It was chunked in the middle of a sentence (there were no whitespace in it!).
msg119342 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-21 22:02
Would you like to turn your suggestions (+ hinting at buffer_text someplace) into a patch for Doc/library/pyexpat.rst?
msg119357 - (view) Author: Maciek J (Maciek.J) Date: 2010-10-22 00:45
I'm not familiar with the rst format, but I hope this works.
msg121005 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-12 01:42
Thanks for the patch.  There are a few typos (pices, recive) and markup glitches, which you can fix if you’d like to learn more about the markup, or else leave to someone else.  Those glitches are: bad indentation, missing blank line to make a new paragraph, text in backquotes without a :role: (or double backquotes for False).  From a checkout, run “make html” to see the result.
msg121006 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-12 01:43
Also, s/receive few calls/receive more than one call/ (clearer IMO).
msg121161 - (view) Author: Maciek J (Maciek.J) Date: 2010-11-13 23:31
Couldn't compile to html at the moment, but it should be fine anyway.

Note that I didn't wanted to start a new paragraph (I'm guessing you meant the sentence at line 13 of the patch) as there was no new paragraph in a previous version.
msg142429 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-08-19 12:27
I was about to commit an edited version of your patch (attached) but then I thought we should check whether this isn’t really a bug.  I just don’t see why expat would chunk without paying heed to the newlines if it is supposed to chunk at newlines.
msg142439 - (view) Author: Fred L. Drake, Jr. (fdrake) (Python committer) Date: 2011-08-19 12:58
Chunking of the data is expected with Expat.  There are no promises about *where* chunks are broken; the underlying behavior will break at line endings, but is not limited to that.

Setting buffer_text informs the Python wrapper that it's allowed to combine the chunks reported by the Expat library; this was made optional since it could affect working applications (changing the default with the move to Python 3 may have been acceptable, though).
History
Date User Action Args
2011-08-19 12:58:55fdrakesetnosy: + fdrake
messages: + msg142439
2011-08-19 12:27:36eric.araujosetfiles: + pyexpat.rst.patch

messages: + msg142429
2010-11-13 23:31:27Maciek.Jsetfiles: + pyexpat.rst.patch

messages: + msg121161
2010-11-12 01:43:00eric.araujosetmessages: + msg121006
2010-11-12 01:42:08eric.araujosetmessages: + msg121005
2010-10-22 00:45:25Maciek.Jsetfiles: + pyexpat.rst.patch
keywords: + patch
messages: + msg119357
2010-10-21 22:02:20eric.araujosetnosy: + eric.araujo
messages: + msg119342
2010-10-20 18:45:40r.david.murraysetnosy: + docs@python
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
assignee: docs@python
components: + Documentation, - XML
type: behavior
stage: needs patch
2010-10-20 18:05:27Maciek.Jsetmessages: + msg119229
2010-10-20 11:54:32r.david.murraysetfiles: + xml-parse-revisions.py
nosy: + r.david.murray
messages: + msg119202

2010-10-20 01:43:19Maciek.Jcreate