This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author scoder
Recipients amaury.forgeotdarc, doerwalter, lemburg, scoder, serhiy.storchaka
Date 2013-09-14.05:26:44
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1379136405.27.0.0528394382778.issue18059@psf.upfronthosting.co.za>
In-reply-to
Content
I don't think I have my head deep enough in the encodings implementation to say that this is the correct/best way to do it, but the patch looks mostly reasonable to me and would be a helpful addition.

I have two comments on the pyexpat_encoding_convert() function:

1) I can't see a safe-guard against reading beyond the data buffer. What if s already points to the last byte and we are trying to read two or three bytes to decode them? I wouldn't be surprised to see that this kind of input can be crafted.

2) Creating a throw-away Unicode object through a named decoder looks like a huge overhead for decoding two bytes. It might be considered an optimisation to change that, but if you are really trying to parse a longer XML document with lots of Japanese text in it (i.e. if you actually *need* this feature), it will most likely end up being way too slow to make any real use of it.

I think that both points should be addressed before this gets added.
History
Date User Action Args
2013-09-14 05:26:45scodersetrecipients: + scoder, lemburg, doerwalter, amaury.forgeotdarc, serhiy.storchaka
2013-09-14 05:26:45scodersetmessageid: <1379136405.27.0.0528394382778.issue18059@psf.upfronthosting.co.za>
2013-09-14 05:26:45scoderlinkissue18059 messages
2013-09-14 05:26:44scodercreate