Message 100902 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	scoder
Recipients	effbot, flox, georg.brandl, gvanrossum, r.david.murray, scoder
Date	2010-03-12.06:48:58
SpamBayes Score	1.110223e-16
Marked as misclassified	No
Message-id	<1268376545.19.0.528228682473.issue8047@psf.upfronthosting.co.za>
In-reply-to

Content
Hi Guido, your comment was long overdue in this discussion. Guido van Rossum, 12.03.2010 01:35: > My thinking was that since an XML document looks like text, it should > probably be considered text, at least by default. (There may have > been some unittests that appeared to require this -- of course this > was probably just the confusion between byte strings and 8-bit text > strings inherent in Python 2.) Well, well, XML... It does look like text, but it's encoded text that is defined as a stream of bytes, and that's the only safe way of dealing with it. There certainly is a use case for treating the serialised result as text, that's why lxml has this feature. A minor one is for debug output (which certainly doesn't merit being the default), but another one is when dealing with HTML, where encoding information is certainly less well defined and much less often seen in the wild. So users tend to be happy when they get their real-world HTML input fixed up into proper Unicode, still happier when they see that lxml can parse that correctly and even serialise the result back into a Unicode string directly, that they can post-process as text if they need to. However, the main part here is the input, i.e. getting HTML data properly decoded into Unicode. The output part is a lot less important, and it's often easier to let lxml.html do the correct serialisation into bytes with proper encoding meta information, rather than dealing with it yourself. Those are the two use cases I see for lxml. Their impact on ElementTree is relatively low as it doesn't support parsing from a Unicode string, so the most important HTML feature isn't there in the first place. The lack of major use cases in ElementTree is one of the reasons I'm so opposed to making this feature the backwards incompatible default for the output side. > Regarding backwards compatibility, there are now two backwards > compatibility problems: with 2.x, and with 3.1. It seems we cannot > easily be backwards compatible with both (though if someone figures > out a way that would be best of course). > > If I were to propose an API for returning a Unicode string, I would > probably add a new method (e.g. tounicode()) rather than using a > "magical" argument (tostring(encoding=str)), but given that that > exists in another supposedly-compatible implementation I'm not > against it. Actually, lxml.etree originally had a tounicode() function for this purpose, and I deprecated it in favour of tostring(encoding=unicode) to avoid having a separate interface for this, while staying just as explicit as before. I'm aware that this wasn't an all-win decision, but I found passing the unicode type to be explicit enough, and separate enough from an encoding /name/ to make it clear what happens. It's certainly less beautiful in Py3, where you write "tostring(encoding=str)". I still didn't remove the function from the API, but it's been deprecated for years. Reactivating it in lxml.etre, and duplicating it in ET would safe lxml.etree from having to break user code (as "tostring(encoding=str)" could simply continue to work, but disappear from the docs). It wouldn't safe ET-Py3 from breaking backwards compatibility to itself, though. > Maybe tostring(encoding=None) could also be made to work? That would > at least make it possible to write code that receives a text object > and that works in 3.1 and 3.2 both. In 2.x I think neither of these > should work, and there probably isn't a need -- apps needing full > compatibility will just have to refrain from calling tostring() > without arguments. It could be made to work, and it doesn't even read that bad. I can't imagine anyone using this explicitly to get the default behaviour, although you never know how people put together their keyword argument dicts programmatically. 'None' has always been the documented default for the encoding parameter, so I'm sure there's at least a tiny bit of code that uses it to say "I'm not overriding the default here". Actually, the encoding has been a keyword-only parameter in lxml.etree for ages, which was ok with the original default and conform with the official ET documentation. So it would be easy to switch here, although not beautiful in the implementation. Same for ElementTree, where the current default None in the signature could simply be replaced by the 'real' default 'us-ascii'. Within the Py3 series, this change would not keep up backwards compatibility either. So, as a solution, I do prefer separating this feature out into a separate function, so that we can simplify the interface of tostring() into always returning a byte string serialisation, as it always was in ET. The rather distinct use case of serialising to an unencoded text string can well be handled by a tounicode() function. > ISTM that the behavior of write() is just fine -- the contents of the > file will be correct after all. Not according to the Py3.2 dev docs of open(): """ 'encoding' is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns) """ So if a users "preferred encoding" is not UTF-8 compatible, then writing out the Unicode serialisation will result in an incorrect XML serialisation, as an XML byte stream without encoding declaration is assumed to be in UTF-8 by specification. Stefan

Hi Guido, your comment was long overdue in this discussion.

Guido van Rossum, 12.03.2010 01:35:
> My thinking was that since an XML document looks like text, it should
> probably be considered text, at least by default.  (There may have
> been some unittests that appeared to require this -- of course this
> was probably just the confusion between byte strings and 8-bit text
> strings inherent in Python 2.)

Well, well, XML...

It does look like text, but it's encoded text that is defined as a stream of bytes, and that's the only safe way of dealing with it.

There certainly *is* a use case for treating the serialised result as text, that's why lxml has this feature. A minor one is for debug output (which certainly doesn't merit being the default), but another one is when dealing with HTML, where encoding information is certainly less well defined and *much* less often seen in the wild. So users tend to be happy when they get their real-world HTML input fixed up into proper Unicode, still happier when they see that lxml can parse that correctly and even serialise the result back into a Unicode string directly, that they can post-process as text if they need to.

However, the main part here is the input, i.e. getting HTML data properly decoded into Unicode. The output part is a lot less important, and it's often easier to let lxml.html do the correct serialisation into bytes with proper encoding meta information, rather than dealing with it yourself.

Those are the two use cases I see for lxml. Their impact on ElementTree is relatively low as it doesn't support *parsing* from a Unicode string, so the most important HTML feature isn't there in the first place. The lack of major use cases in ElementTree is one of the reasons I'm so opposed to making this feature the backwards incompatible default for the output side.


> Regarding backwards compatibility, there are now two backwards
> compatibility problems: with 2.x, and with 3.1.  It seems we cannot
> easily be backwards compatible with both (though if someone figures
> out a way that would be best of course).
> 
> If I were to propose an API for returning a Unicode string, I would
> probably add a new method (e.g. tounicode()) rather than using a
> "magical" argument (tostring(encoding=str)), but given that that
> exists in another supposedly-compatible implementation I'm not
> against it.

Actually, lxml.etree originally had a tounicode() function for this purpose, and I deprecated it in favour of tostring(encoding=unicode) to avoid having a separate interface for this, while staying just as explicit as before.  I'm aware that this wasn't an all-win decision, but I found passing the unicode type to be explicit enough, and separate enough from an encoding /name/ to make it clear what happens. It's certainly less beautiful in Py3, where you write "tostring(encoding=str)".

I still didn't remove the function from the API, but it's been deprecated for years. Reactivating it in lxml.etre, and duplicating it in ET would safe lxml.etree from having to break user code (as "tostring(encoding=str)" could simply continue to work, but disappear from the docs). It wouldn't safe ET-Py3 from breaking backwards compatibility to itself, though.


> Maybe tostring(encoding=None) could also be made to work? That would
> at least make it *possible* to write code that receives a text object
> and that works in 3.1 and 3.2 both.  In 2.x I think neither of these
> should work, and there probably isn't a need -- apps needing full
> compatibility will just have to refrain from calling tostring()
> without arguments.

It could be made to work, and it doesn't even read that bad. I can't imagine anyone using this explicitly to get the default behaviour, although you never know how people put together their keyword argument dicts programmatically. 'None' has always been the documented default for the encoding parameter, so I'm sure there's at least a tiny bit of code that uses it to say "I'm not overriding the default here".

Actually, the encoding has been a keyword-only parameter in lxml.etree for ages, which was ok with the original default and conform with the official ET documentation. So it would be easy to switch here, although not beautiful in the implementation. Same for ElementTree, where the current default None in the signature could simply be replaced by the 'real' default 'us-ascii'. Within the Py3 series, this change would not keep up backwards compatibility either.

So, as a solution, I do prefer separating this feature out into a separate function, so that we can simplify the interface of tostring() into always returning a byte string serialisation, as it always was in ET. The rather distinct use case of serialising to an unencoded text string can well be handled by a tounicode() function.


> ISTM that the behavior of write() is just fine -- the contents of the
> file will be correct after all.

Not according to the Py3.2 dev docs of open():

"""
'encoding' is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns)
"""

So if a users "preferred encoding" is not UTF-8 compatible, then writing out the Unicode serialisation will result in an incorrect XML serialisation, as an XML byte stream without encoding declaration is assumed to be in UTF-8 by specification.

Stefan

History
Date	User	Action	Args
2010-03-12 06:49:05	scoder	set	recipients: + scoder, gvanrossum, effbot, georg.brandl, r.david.murray, flox
2010-03-12 06:49:05	scoder	set	messageid: <1268376545.19.0.528228682473.issue8047@psf.upfronthosting.co.za>
2010-03-12 06:49:03	scoder	link	issue8047 messages
2010-03-12 06:48:58	scoder	create