Message 93737 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vinay.sajip
Recipients	rszefler, vinay.sajip
Date	2009-10-08.09:03:30
SpamBayes Score	3.8857806e-16
Marked as misclassified	No
Message-id	<747090.49681.qm@web25804.mail.ukl.yahoo.com>
In-reply-to	<1254989357.9.0.704232203528.issue7077@psf.upfronthosting.co.za>

Content
> Robert Szefler added the comment: > > Fine with me, though problems would arise. Default encoding for example. > If encoding selection is mandatory it would break compatibility. Using > default locale is not such a good idea - local machine's locale would > generally not need to have any correlation to the remote logger's. I'm not planning to make encoding selection mandatory: I would provide a parameter encoding=None so backward compatibility is preserved. On 2.x: During output, a check will be made for Unicode. If not found, the data is output as is. Otherwise (if Unicode) it's encoded using either the specified encoding (if not None) or some default - for example, locale.getpreferredencoding(). I understand what you're saying about the locales of two different machines not being the same - but there's no way around this, because if a socket receives bytes representing text, it needs to know what encoding was used so that it can reconstruct the Unicode correctly. So that information at least needs to be known at the receiving end, rather than guessd. While 'utf-8' might be a reasonable choice, I'm not sure it should be enforced. So the code sending the bytes can specify e.g. 'cp1251' and the other end has to know so it can decode correctly. I've posted on python-dev for advice about what encoding to use if none is specified. On 3.x: We will always be passing Unicode in so we will always need to convert to bytes using some encoding. Again, if not specified, a suitable default encoding needs to be chosen. > Maybe the best solution would be to coerce the text to ASCII per default > (such as not to break current semantics) but fix the exception thrown > (throw an Unicode*Error) and allow an optional encoding parameter to > handle non-ASCII characters? I'm not exactly sure what you mean, but I think I've covered it in my comments above. To summarise: On 2.x, encoding is not mandatory but if Unicode is passed in, either a specified encoding or a suitable default encoding will be used to encode the Unicode into str. On 3.x, encoding is not mandatory and Unicode should always be passed in, which will be encoded to bytes using either a specified encoding or a suitable default encoding.

> Robert Szefler  added the comment:

> 
> Fine with me, though problems would arise. Default encoding for example.
> If encoding selection is mandatory it would break compatibility. Using
> default locale is not such a good idea - local machine's locale would
> generally not need to have any correlation to the remote logger's.

I'm not planning to make encoding selection mandatory: I would provide a parameter encoding=None so backward compatibility is preserved.

On 2.x: During output, a check will be made for Unicode. If not found, the data is output as is. Otherwise (if Unicode) it's encoded using either the specified encoding (if not None) or some default - for example, locale.getpreferredencoding().

I understand what you're saying about the locales of two different machines not being the same - but there's no way around this, because if a socket receives bytes representing text, it needs to know what encoding was used so that it can reconstruct the Unicode correctly. So that information at least needs to be known at the receiving end, rather than guessd. While 'utf-8' might be a reasonable choice, I'm not sure it should be enforced. So the code sending the bytes can specify e.g. 'cp1251' and the other end has to know so it can decode correctly. I've posted on python-dev for advice about what encoding to use if none is specified.

On 3.x: We will always be passing Unicode in so we will always need to convert to bytes using some encoding. Again, if not specified, a suitable default encoding needs to be chosen.

> Maybe the best solution would be to coerce the text to ASCII per default
> (such as not to break current semantics) but fix the exception thrown
> (throw an Unicode*Error) and allow an optional encoding parameter to
> handle non-ASCII characters?

I'm not exactly sure what you mean, but I think I've covered it in my comments above. To summarise:

On 2.x, encoding is not mandatory but if Unicode is passed in, either a specified encoding or a suitable default encoding will be used to encode the Unicode into str.

On 3.x, encoding is not mandatory and Unicode should always be passed in, which will be encoded to bytes using either a specified encoding or a suitable default encoding.

History
Date	User	Action	Args
2009-10-08 09:03:33	vinay.sajip	set	recipients: + vinay.sajip, rszefler
2009-10-08 09:03:32	vinay.sajip	link	issue7077 messages
2009-10-08 09:03:31	vinay.sajip	create