Message 100907 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	effbot
Recipients	effbot, flox, georg.brandl, gvanrossum, r.david.murray, scoder
Date	2010-03-12.09:00:45
SpamBayes Score	6.1498795e-08
Marked as misclassified	No
Message-id	<1268384448.06.0.254820444753.issue8047@psf.upfronthosting.co.za>
In-reply-to

Content
"'None' has always been the documented default for the encoding parameter" That's probably mostly by accident at least in original ET, but the 1.3 draft docs at effbot.org/elementtree does spell it out explicitly for the 'write' method: Output encoding. If omitted or set to None, defaults to US-ASCII. Not sure I'd consider this text binding in itself, though (even if I'd argue that it's preferred to have the same interpretation of encoding everywhere). "writing out the Unicode serialisation will result in an incorrect XML serialisation" I think Guido meant the ElementTree.write method; is that broken too? The file.write(et.tostring()) issue is probably my most pressing concern here; that's a common use case (e.g. when using "iterparse" to cut pieces from a big document), and the defaults were chosen to increase the chance that this automatically do the right thing for non-ASCII even if the programmer never tests it. In 3.X, that construct is suddenly dependent on the interpreter's default encoding. I think I'd prefer old "tostring" behaviour and a separate "tounicode" function, and I'm still not convinced that the latter is required for the XML use case (which implies that maybe it should live in lxml.html for the HTML case, even if it ends up calling the same internal implementation). Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?

"'None' has always been the documented default for the encoding parameter"

That's probably mostly by accident at least in original ET, but the 1.3 draft docs at effbot.org/elementtree does spell it out explicitly for the 'write' method:

   Output encoding. If omitted or set to None, defaults to US-ASCII.

Not sure I'd consider this text binding in itself, though (even if I'd argue that it's preferred to have the same interpretation of encoding everywhere).

"writing out the Unicode serialisation will result in an incorrect XML serialisation"

I think Guido meant the ElementTree.write method; is that broken too?

The file.write(et.tostring()) issue is probably my most pressing concern here; that's a common use case (e.g. when using "iterparse" to cut pieces from a big document), and the defaults were chosen to increase the chance that this automatically do the right thing for non-ASCII even if the programmer never tests it.  In 3.X, that construct is suddenly dependent on the interpreter's default encoding.

I think I'd prefer old "tostring" behaviour and a separate "tounicode" function, and I'm still not convinced that the latter is required for the XML use case (which implies that maybe it should live in lxml.html for the HTML case, even if it ends up calling the same internal implementation).

Or should that be "tobytes" and "tounicode" to eliminate all ambiguity?

History
Date	User	Action	Args
2010-03-12 09:00:48	effbot	set	recipients: + effbot, gvanrossum, georg.brandl, scoder, r.david.murray, flox
2010-03-12 09:00:48	effbot	set	messageid: <1268384448.06.0.254820444753.issue8047@psf.upfronthosting.co.za>
2010-03-12 09:00:46	effbot	link	issue8047 messages
2010-03-12 09:00:45	effbot	create