This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: str/unicode encoding kwarg causes exceptions
Type: enhancement Stage:
Components: Interpreter Core Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: berker.peksag, eric.smith, ezio.melotti, lemburg, mahmoud, martin.panter, r.david.murray, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2015-04-21 04:57 by mahmoud, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (13)
msg241699 - (view) Author: Mahmoud Hashemi (mahmoud) * Date: 2015-04-21 04:57
The encoding keyword argument to the Python 3 str() and Python 2 unicode() constructors is excessively constraining to the practical use of these core types.

Looking at common usage, both these constructors' primary mode is to convert various objects into text:

>>> str(2)
'2'

But adding an encoding yields:

>>> str(2, encoding='utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: coercing to str: need bytes, bytearray or buffer-like object, int found

While the error message is fine for an experienced developer, I would like to raise the question: is it necessary at all? Even harmlessly getting a str from a str is punished, but leaving off encoding is fine again:

>>> str('hi', encoding='utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoding str is not supported
>>> str('hi')
'hi'

Merging and simplifying the two modes of these constructors would yield much more predictable results for experienced and beginning Pythonists alike. Basically, the encoding argument should be ignored if the argument is already a unicode/str instance, or if it is a non-string object. It should only be consulted if the primary argument is a bytestring. Bytestrings already have a .decode() method on them, another, obscurer version of it isn't necessary.

Furthermore, despite the core nature and widespread usage of these types, changing this behavior should break very little existing code and understanding. unicode() and str() will simply behave as expected more often, returning text versions of the arguments passed to them. 

Appendix: To demonstrate the expected behavior of the proposed unicode/str, here is a code snippet we've employed to sanely and safely get a text version of an arbitrary object:

def to_unicode(obj, encoding='utf8', errors='strict'):
    # the encoding default should look at sys's value
    try:
        return unicode(obj)
    except UnicodeDecodeError:
        return unicode(obj, encoding=encoding, errors=errors)

After many years of writing Python and teaching it to developers of all experience levels, I firmly believe that this is the right interaction pattern for Python's core text type. I'm also happy to expand on this issue, turn it into a PEP, or submit a patch if there is interest.
msg241712 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-04-21 13:21
As this is an enhancement request, I've changed the versions.

I'm opposed to this change. If I pass an encoding along with a type for which it makes no sense, I'd prefer an error instead of silently ignoring the encoding.

I think your helper function is an appropriate solution to your problem.
msg241730 - (view) Author: Mahmoud Hashemi (mahmoud) * Date: 2015-04-21 18:31
Python already has one approach that fails to decode non-bytestrings: the .decode() method. 

This is about removing unicode barriers to entry and making the str constructor in Python 3 as succinctly useful as possible. There are several problems the helper does not solve:

1) Usage-wise, str/unicode is used to turn values into text. From a high-level perspective, the content does not change, only the representation format. Should this fundamental operation really require type inspection and explicit try/except blocks every single time? Or should it just work? sorted() does not raise an exception if the values are already sorted, why does str() raise an exception when the value is already a str?*

2) By and large, among developers, keyword arguments are viewed as "optional" arguments that have defaults which can be overridden. However, that is not the case here; str is not simply str(obj, encoding=sys.getdefaultencoding()). Explicitly passing the keyword argument breaks the call.

3) The helper does not help promote Python adoption when it must be copied and pasted it into new developer's projects. It does not help break down the misconception that unicode is a punishing concept to be around in Python.

* This question is posed here rhetorically, but I have gotten variations on it from multiple Python developers in training.
msg241759 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-04-22 00:15
I don’t think changes to Python 2 are considered here, unless they are bug fixes, and this does not sound like a bug fix.

For Python 3, it sounds like you are proposing that str() accept encoding arguments even when not decoding from bytes. It sounds like this would mask the error if you called str(buffer, "ascii"), and the buffer happened to be an integer or a list, etc, by accident. Also, this woul

It seems str() is designed to have two separate modes:

1. str(object) is basically equivalent to format(object), with a warning if “object” happens to be a byte string or array

2. str(object, encoding, ...) is normally equivalent to object.decode(encoding, ...), or if that is not supported, codecs.decode(object, encoding, ...)

Your proposal sounds like it would make it easier to confuse these two modes. What should str(b"123", encoding=None) do? Why should the behaviour of str(file, encoding) vary depending on whether an ordinary file object or a memory-mapped file is passed?

IMO in a perfect Python 4 world, str() would only have a single personality (perhaps always returning an empty string, or a more strict conversion). Making a formatted string representations of arbitrary objects would be left to the format() and repr() functions, and decoding bytes to text would be left to the existing decode() methods and functions, or maybe a separate str.from_bytes() constructor, mirroring int.from_bytes().
msg241761 - (view) Author: Mahmoud Hashemi (mahmoud) * Date: 2015-04-22 00:57
Martin, it sounds that way because that is what is being proposed: "Merging and simplifying the two modes". Given the existence of .decode() on bytestrings, the only objects that generally need decoding in Python 2 and 3, the existence of str/unicode's second mode constitutes a design bug.

Without a doubt, Python has frequently preferred convenient idioms over EAFP. Look at dict.get for an excellent example of defaults being used instead of forcing users to catch KeyErrors. That conversation could have gone a different way, but Python is better off having stuck to its pragmatic roots.

In answer to your questions, Martin, 1) I'd expect str(b"123", encoding=None) to do the same thing as str(b"123")  and 2) I'd expect str(obj) behavior to continue to depend on whether the object passed is string-like. Python is a duck-typed, dynamic language, and dynamic languages are most powerful when their core types reflect usability. Consistency is one of the foremost factors of usability, and having to frequently switch between two call patterns of the str constructor feels inconsistent and unusable.
msg241762 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-22 01:23
It does feel like the encoding argument is left over from the translation of the unicode constructor into the str constructor.  I wouldn't be opposed to deprecating it, myself, though we'd probably never remove it.  I would be opposed to making it work on non-bytes-like objects.
msg241763 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-04-22 01:47
Okay, I was trying to confirm your proposal in Python 3 terms, because in Python 2, str has a different meaning and I was confused.

I agree that the existence of the decoding mode is a design bug, so how would you feel about deprecating it, at least in the documentation? I.e. in Python 3, deprecate usage like str(buffer, "utf-8") in favour of buffer.decode("utf-8") or using the codecs module directly. If this was done, it would clearly remove the need for an encoding parameter to str() in all cases. I would be in favour of deprecating the complementary bytes() and bytearray() encoding modes as well.

Do you have an example use case in Python 3 that would benefit from always allowing an encoding parameter? I can understand that your to_unicode() function could be useful in Python 2. But in Python 3, byte strings tend to hold raw data that is not necessarily textual at all. There are some places (warts in my opinion) such as the binascii module where ASCII-encoded byte strings are common, but I still don’t think this proposal would be very helpful with that.
msg241764 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-04-22 03:16
I agree with deprecating (in the documentation) but never removing the encoding argument to str() in Python 3. .decode() is the better way to convert a bytes-like object to a str.

Every change proposed here would be an enhancement in 2.7, and we are not implementing enhancements there.
msg241777 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-04-22 06:44
Please don't deprecate the encoding parameter in str. It has a use case. str constructor works with any bytes-like objects, even with these that don't have the decode method. It raises more appropriate TypeError instead of AttributeError, so often you don't need to wrap an error.
msg241783 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-04-22 07:40
I thought it might be okay to use codecs.decode() instead for those cases, though it doesn’t check for text encodings. And support for arbitrary bytes-like object doesn’t seem to be documented (though seems to work in reality).
msg241795 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-22 12:49
Sounds like we should close this as rejected, then.  Serhiy's point is a good one.  Maybe not the way we'd design the api from scratch, but it's what we've got and it serves a purpose.
msg241797 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-04-22 13:08
I agree with closing this as "won't fix".

It is true that the encoding keyword argument is only useful when passing in byte strings or (and that's also where it originated in Python 2: the default string type is a byte string), but even in Python 3, this is still one of the main uses of the str() constructor.

Note that it's not uncommon to have arguments only be useful for certain types of input objects. See e.g. the int() constructor base argument for similar example.
msg241847 - (view) Author: Mahmoud Hashemi (mahmoud) * Date: 2015-04-23 06:49
I would urge you all take a stronger look at usability, rather than parroting the current state of the design and docs. Python gained renown over the years for its ability to stay flexible while maturing. Focusing on purity and ignoring the needs of practical programmers is exactly how PEP #461 ended up coming into play so late.

The inflexible arguments of str makes a common task, turning data into text, an order of magnitude harder than it needs to be.
History
Date User Action Args
2022-04-11 14:58:15adminsetgithub: 68207
2015-04-23 06:49:57mahmoudsetmessages: + msg241847
2015-04-22 13:17:59benjamin.petersonsetstatus: open -> closed
resolution: rejected
2015-04-22 13:08:47lemburgsetnosy: + lemburg
messages: + msg241797
2015-04-22 12:49:13r.david.murraysetmessages: + msg241795
2015-04-22 07:40:09martin.pantersetmessages: + msg241783
2015-04-22 06:44:08serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg241777
2015-04-22 03:29:34berker.peksagsetnosy: + berker.peksag
2015-04-22 03:16:58eric.smithsetmessages: + msg241764
versions: + Python 3.6, - Python 2.7, Python 3.4
2015-04-22 01:47:43martin.pantersetmessages: + msg241763
2015-04-22 01:23:34r.david.murraysetnosy: + r.david.murray
messages: + msg241762
2015-04-22 00:57:42mahmoudsetmessages: + msg241761
2015-04-22 00:15:54martin.pantersetnosy: + martin.panter
messages: + msg241759
2015-04-21 18:31:12mahmoudsetmessages: + msg241730
versions: + Python 2.7
2015-04-21 13:21:24eric.smithsetversions: - Python 2.7, Python 3.6
nosy: + eric.smith

messages: + msg241712

components: + Interpreter Core, - Unicode
type: behavior -> enhancement
2015-04-21 04:57:04mahmoudcreate