Message 265881 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	josh.r
Recipients	benspiller, docs@python, ezio.melotti, josh.r, serhiy.storchaka, steven.daprano, terry.reedy
Date	2016-05-19.18:37:25
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1463683045.87.0.664883378198.issue26369@psf.upfronthosting.co.za>
In-reply-to

Content
Agree with Steven; the whole reason Python 3 changed from unicode and str to str and bytes was because having Py2 str be text sometimes, and binary data at other times is confusing. The existing behavior can't change in Py2 in any meaningful way without breaking existing code, introducing special cases for text->text encodings (where Python 3 supports them using the codecs module only), behaving in non-obvious ways in corner cases, etc. Silently treating str.encode("utf-8") to mean "decode as UTF-8 and throw away the result to verify that it's already UTF-8 bytes" is not particularly intuitive either. It does seem like a doc fix would be useful though; right now, we have only "String methods" documented, with no distinction between str and unicode. It might be helpful to explicitly deprecate str.encode on str objects, and unicode.decode, with a note that while it's meaningful to use these methods in Python 2 for text<->text encoding/decoding, the methods don't exist at all in Python 3. Otherwise, yes, if you want consistent text/binary types, that's what Python 3 is for. Python 2 has tons of flaws when it comes to handling unicode (e.g. csv module), and fixing any given single problem (creating backward compatibility headaches in the process) is not worth the trouble. If you're concerned about excessive boilerplate, just write a function (or a type) that allows you to perform the tests/conversions you care about as a single call. For example, the following seems like it achieves your objectives (one line usage, handles str by verifying that it's legal in provided encoding in strict mode, dropping/replacing characters in ignore/replace mode, etc.): def basestringencode(s, encoding=sys.getdefaultencoding(), errors="strict"): if isinstance(s, str): # Decode with provided rules, so a str with illegal characters # raises exception, replaces, ignores, etc. per arguments s = s.decode(encoding, errors) return s.encode(encoding, errors) If you don't want to see UnicodeDecodeError, you either pass 'ignore' for errors, or wrap the s.decode step in a try/except and raise a different exception type. The biggest change I could see happening code wise would be a textual change to the UnicodeDecodeError error str.encode raises, so str.encode specifically replaces the default error message (but not type, for back compat reasons) with something like "str.encode cannot perform implicit decode with sys.getdefaultencoding(); use .encode only with unicode objects"

Agree with Steven; the whole reason Python 3 changed from unicode and str to str and bytes was because having Py2 str be text sometimes, and binary data at other times is confusing. The existing behavior can't change in Py2 in any meaningful way without breaking existing code, introducing special cases for text->text encodings (where Python 3 supports them using the codecs module only), behaving in non-obvious ways in corner cases, etc. Silently treating str.encode("utf-8") to mean "decode as UTF-8 and throw away the result to verify that it's already UTF-8 bytes" is not particularly intuitive either.

It does seem like a doc fix would be useful though; right now, we have only "String methods" documented, with no distinction between str and unicode. It might be helpful to explicitly deprecate str.encode on str objects, and unicode.decode, with a note that while it's meaningful to use these methods in Python 2 for text<->text encoding/decoding, the methods don't exist at all in Python 3.

Otherwise, yes, if you want consistent text/binary types, that's what Python 3 is for. Python 2 has tons of flaws when it comes to handling unicode (e.g. csv module), and fixing any given single problem (creating backward compatibility headaches in the process) is not worth the trouble.

If you're concerned about excessive boilerplate, just write a function (or a type) that allows you to perform the tests/conversions you care about as a single call. For example, the following seems like it achieves your objectives (one line usage, handles str by verifying that it's legal in provided encoding in strict mode, dropping/replacing characters in ignore/replace mode, etc.):

def basestringencode(s, encoding=sys.getdefaultencoding(), errors="strict"):
    if isinstance(s, str):
        # Decode with provided rules, so a str with illegal characters
        # raises exception, replaces, ignores, etc. per arguments
        s = s.decode(encoding, errors)
    return s.encode(encoding, errors)

If you don't want to see UnicodeDecodeError, you either pass 'ignore' for errors, or wrap the s.decode step in a try/except and raise a different exception type.

The biggest change I could see happening code wise would be a textual change to the UnicodeDecodeError error str.encode raises, so str.encode specifically replaces the default error message (but not type, for back compat reasons) with something like "str.encode cannot perform implicit decode with sys.getdefaultencoding(); use .encode only with unicode objects"

History
Date	User	Action	Args
2016-05-19 18:37:25	josh.r	set	recipients: + josh.r, terry.reedy, ezio.melotti, steven.daprano, docs@python, serhiy.storchaka, benspiller
2016-05-19 18:37:25	josh.r	set	messageid: <1463683045.87.0.664883378198.issue26369@psf.upfronthosting.co.za>
2016-05-19 18:37:25	josh.r	link	issue26369 messages
2016-05-19 18:37:25	josh.r	create