classification
Title: unicode.decode and str.encode are unnecessarily confusing for non-ascii
Type: behavior Stage:
Components: Documentation Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: benspiller, docs@python, ezio.melotti, josh.r, lemburg, serhiy.storchaka, steven.daprano, terry.reedy
Priority: normal Keywords:

Created on 2016-02-16 11:58 by benspiller, last changed 2016-05-20 09:34 by benspiller.

Messages (15)
msg260359 - (view) Author: Ben Spiller (benspiller) * Date: 2016-02-16 11:58
It's well known that lots of people struggle writing correct programs using non-ascii strings in python 2.x, but I think one of the main reasons for this could be very easily fixed with a small addition to the documentation for str.encode and unicode.decode, which is currently quite vague. 

The decode/encode methods really make most sense when called on a unicode string i.e. unicode.encode() to produce a byte string, or on a byte string e.g. str.decode() to produce a unicode object from a byte string. 

However, the additional presence of the opposite methods str.encode() and unicode.decode() is quite confusing, and a frequent source of errors - e.g. calling str.encode('utf-8') first DECODES the str object (which might already be in utf8) to a unicode string **using the default encoding of "ascii"** (!) before ENCODING to a utf-8 byte str as requested, which of course will fail at the first stage with the classic error "UnicodeDecodeError: 'ascii' codec can't decode byte" if there are any non-ascii chars present. It's unfortunate that this initial decode/encode stage ignores both the "encoding" argument (used only for the subsequent encode/decode) and the "errors" argument (commonly used when the programmer is happy with a best-effort conversion e.g. for logging purposes).

Anyway, given this behaviour, a lot of time would be saved by a simple sentence on the doc for str.encode()/unicode.decode() essentially warning people that those methods aren't that useful and they probably really intended to use str.decode()/unicode.encode() - the current doc gives absolutely no clue about this extra stage which ignores the input arguments and sues 'ascii' and 'strict'. It might also be worth stating in the documentation that the pattern (u.encode(encoding) if isinstance(u, unicode) else u) can be helpful for cases where you unavoidably have to deal with both kinds of input, string calling str.encode is such a bad idea. 

In an ideal world I'd love to see the implementation of str.encode/unicode.decode changed to be more useful (i.e. instead of using ascii, it would be more logical and useful to use the passed-in encoding to perform the initial decode/encode, and the apss-in 'errors' value). I wasn't sure if that change would be accepted so for now I'm proposing better documentation of the existing behaviour as a second-best.
msg260361 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-02-16 13:01
Perhaps you could suggest a specific change to the docstrings for str.encode and unicode.decode?

(BTW, I presume you are aware that the equivalent of (bytes)str.encode and unicode.decode are gone in Python 3?)
msg260542 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-02-20 00:29
The intended use for str.encode is for same-type transcoding, like this:

I was unaware of the seemingly useless behavior you quote.

>>> 'abc'.encode('base64')
'YWJj\n'
>>> 'YWJj\n'.decode('base64')
'abc'

Here is a similar use for unicode.decode.

>>> u'abc'.encode('base64')
'YWJj\n'
>>> u'YWJj\n'.decode('base64')
'abc'

Any doc change should make the intended use clear if not already.

(Note that the above give lookup errors in 3.x
>>> 'abc'.encode('base64')
...
LookupError: 'base64' is not a text encoding; use codecs.encode() to handle arbitrary codecs)
msg265389 - (view) Author: Ben Spiller (benspiller) * Date: 2016-05-12 10:14
Thanks that's really helpful

Having thought about it some more, I think if possible it'd be really so much better to actually 'fix' the behaviour for the unicode<->str standard codecs (i.e. not base64) rather than just documenting around it. The current behaviour is not only confusing but leads to bugs that are very easy to miss since the methods work correctly when given 7-bit ascii characters. 

I had a poke around in the python source but couldn't quite identify where it's happening - presumably there is somewhere in the str.encode('utf-8') implementation that first "decodes" the string and does so using the ascii codec. If it could be made to use the same encoding that was passed in (e.g. utf8) then this would end up being a no-op and there would be no unpleasant bugs that only appear when the input includes non-ascii characters. 

It would also allow X.encode('utf-8') to be called successfully whether X is already a str or is a unicode object, which would save callers having to explicitly check what kind of string they've been passed. 

Is anyone able to look into the code to see where this would need to be fixed and how difficult it would be to do? I have a feeling that once the line is located it might be quite a straightforward fix

Many thanks
msg265390 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-05-12 10:25
Note that with the -3 option Python 2.7 already warns about incompatibilities. 

>>> 'abc'.encode('base64')
__main__:1: DeprecationWarning: 'base64' is not a text encoding; use codecs.encode() to handle arbitrary codecs
'YWJj\n'
>>> 'YWJj\n'.decode('base64')
__main__:1: DeprecationWarning: 'base64' is not a text encoding; use codecs.decode() to handle arbitrary codecs
'abc'
>>> u'abc'.decode('ascii')
__main__:1: DeprecationWarning: decoding Unicode is not supported in 3.x
u'abc'
msg265392 - (view) Author: Ben Spiller (benspiller) * Date: 2016-05-12 11:02
yes the situation is loads better in python 3, this issue is specific to 2.x, but like many people sadly we're not able to move to 3 for the time being. 

Since making this mistake is quite common and there's some sensible behaviour that would make it disappear (resulting in ascii and non-ascii strings being treated the same way by these methods) I'd much prefer if we could actually fix it for python 2.7
msg265394 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-05-12 11:20
What do you propose? Note that str.encode() doesn't raise an exception. Ascii unicode and 8-bit strings are interchangeable. Ascii unicode strings can be packed in str for less memory consumption (see xmlrpclib or ElementTree), a lot of str constant are used in unicode context (like os.sep or empty string). Breaking str.encode() will break valid existing code.
msg265395 - (view) Author: Ben Spiller (benspiller) * Date: 2016-05-12 11:42
I'm proposing that str.encode() should _not_ throw a 'decode' exception  for non-ascii characters and be effectively a no-op, to match what it already does for ascii characters - which therefore shouldn't break behavior anyone will be depending on. This could be achieved by passing the encoding parameter through to the implicit decode() call (which is where the exception is coming from it appears), rather than (arbitrarily and surprisingly) using "ascii" (which of course sometimes works and sometimes doesn't depending on the input string)

Does that make sense?

If someone can find the place in the code (sorry I tried and failed!) where str.encode('utf-8') is resulting in an implicit call to the equivalent of decode('ascii') (as suggested by the exception message) I think it'll be easier to discuss the details
msg265399 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-05-12 12:31
If str.encode() raises a decoding exception, this is a programming bug. It would be bad to hide it.

FYI, the default encoding is not hardcoded 'ascii'. Google "Changing default encoding in Python". Maybe this will help in your program.
msg265873 - (view) Author: Ben Spiller (benspiller) * Date: 2016-05-19 16:23
btw If anyone can find the place in the code (sorry I tried and failed!) where str.encode('utf-8', error=X) is resulting in an implicit call to the equivalent of decode(defaultencoding, errors=strict) (as suggested by the exception message) I think it'll be easier to discuss the details of fixing.

Thanks for your reply - yes I'm aware that theoretically you _could_ globally change python's default encoding from ascii, but the prevailing view I've heard from python developers seems to be that changing it is not a good idea and may cause lots of library code to break. Also it's probably not a good idea for individual libraries or modules to be changing global state that affects the entire python invocation, and it would be nice to find a less fragile and more out-of-the-box solution to this. You may well be using different encodings (not just utf-8) to be used in different parts of your program - so changing the globally-defined default encoding doesn't seem right, especially for a method like str.encode method that already takes an 'encoding' argument (used currently only for the encoding aspect, not the decoding aspect). 

I do think there's a strong case to be made for changing the str.encode (and also unicode.decode) behaviour so that str.encode('utf-8') behaves the same whether it's given ascii or non-ascii characters, and also similar to unicode.encode('utf-8'). Let me try to persuade you... :)

First, to address the point you made:

> If str.encode() raises a decoding exception, this is a programming bug. It would be bad to hide it.

I totally agree with the general principal of not hiding programming bugs. However if calling str.encode for codecs like utf8 (let's ignore base64 for now, which is a very different beast) was *consistently* treated as a 'programming bug' by python and always resulted in an exception that would be ok (suboptimal usability imho, but still ok), since programmers would quickly spot the problem and fix it. But that's not what happens - it *silently works* (is a no-op) as long as you happen to be using ASCII characters so this so-called 'programming bug' will go unnoticed by most programmers (and authors of third party library code you might be relying on!)... but the moment a non-ascii character get introduced suddenly you'll get an exception, maybe in some library code you rely on but can't fix. For this reason I don't think treating this as a programming bug is helping anyone write more robust python code - quite the reverse. Plus I think the behaviour of being a no-op is almost always 'what you would have wanted it to do' anyway, whereas the behaviour of throwing an exception almost never is. 

I think we'd agree that changing str.encode(utf8) to throw an exception in *all* cases wouldn't be a realistic option since it would certainly break backwards compatability in painful ways for many existing apps and library code. 

So, if we want to make the behaviour of this important built-in type a bit more consistent and less error-prone/fragile for this case then I think the only option is making str.encode be a no-op for non-ascii characters (at least, non-ascii characters that are valid in the specified encoding), just as it is for ascii characters. 

Here's why I think ditching the current behaviour would be a good idea:
- calling str.encode() and getting a DecodeError is confusing ("I asked you to encode this string, what are you decoding for?")
- calling str.encode('utf-8') and getting an exception about "ascii" is confusing as the only encoding I mentioned in the method call was utf-8
- calling encode(..., errors=ignore) and getting an exception is confusing and feels like a bug; I've explicitly specified that I do NOT want exceptions from calling this method, yet (because neither 'errors' nor 'encoding' argument gets passed to the implicit - and undocumented - decode operation), I get unexpected behaviour that is far more likely to break my program than a no-op
- the somewhat surprising behaviour we're talking about is not explicitly documented anywhere
- having str.encode throw on non-ascii but not ascii makes it very likely that code will be written and shipped (including library code you may have no control over) that *appears* to work under normal testing but has *hidden* bugs that surface only once non-ascii characters are used. 
- in every situation I can think of, having str.encode(encoding, errors=ignore) honour the encoding and errors arguments even for the implicit-decode operation is more useful than having it ignore those arguments and throw an exception
- a quick google shows lots of people in the Python community (from newbies to experts) are seeing this exception and being confused by it, therefore a lot of people's lives might be improved if we can somehow make the situation better :)
- even with the best of intentions (and with code written by senior python programmers who understand unicode issues well) it's very difficult in practice to write non-trivial python programs that always consistently use the 'unicode' string type throughout (especially when legacy code or third party libraries are involved), so most 'real' code needs to cope with a mix of str and unicode types in practice. So when you need to write a 'basestring' out to a file, you'd like to be able to simply call the s.encode(myencoding, errors=whatever) method that exists on both str and unicode types and have it 'work' whether it's a str already in that encoding or a unicode object that needs to be converted. This is a common use case and the behaviour I'm suggesting would really help with this case. The alternative is that every python programmer who cares about non-ascii characters has to write an pleasant and un-pythonic if clause to give different behaviour based on __type__, in every place they need a byte str:
	if isinstance(s, unicode): 
		f.write(s.encode('utf-8', errors='ignore'))
	else:
		f.write(s)

nb: Although I've used the example of str.encode above, unicode.decode has the exact same issues (and potential solution), and of course this isn't specific to utf-8 but to all codecs that convert between str and unicode (i.e. most of them except base64). 

I hope you'll consider this proposal - it's probably not a very big change, is very unlikely to break any existing/working code, and has the potential to help reduce fragility and difficult-to-resolve bugs in an area of Python that seems to cause pain and confusion to lots of people.

Thanks for considering!
msg265878 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-05-19 17:32
Ben, I'm sorry to see you have spent such a long time writing up reasons for changing this behaviour. I fear this is a total waste of your time, and ours to read it. Python 2.7 is under feature freeze, and changing the behaviour of str.encode and unicode.decode is a new feature. So it could only happen in 2.8, but there will never be a Python 2.8.

If you want more sensible behaviour, then upgrade to Python 3. If you want to improve the docs, then suggest some documentation improvements. But arguing for a change in behaviour of Python 2.7 str.encode and unicode.decode is, I fear, a waste of everyone's time.

If you still wish to champion this change, feel free to raise the issue on the Python-Dev mailing list where the senior developers, including Guido, hang out. I doubt it will do any good, but there is at least the theoretical possibility that if you convince them that this change will encourage people to migrate to Python 3 then you might get your wish.

Just don't hold your breath.
msg265879 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-05-19 17:54
> btw If anyone can find the place in the code (sorry I tried and failed!) where str.encode('utf-8', error=X) is resulting in an implicit call to the equivalent of decode(defaultencoding, errors=strict) (as suggested by the exception message) I think it'll be easier to discuss the details of fixing.

There is no single place. Search lines "str = PyUnicode_FromObject(str);" in Modules/_codecsmodule.c.

> But that's not what happens - it *silently works* (is a no-op) as long as you happen to be using ASCII characters so this so-called 'programming bug' will go unnoticed by most programmers (and authors of third party library code you might be relying on!)... but the moment a non-ascii character get introduced suddenly you'll get an exception, maybe in some library code you rely on but can't fix.

The problem is that encoding ASCII str to UTF-8 is legal operation in some circumstances and is a programming bug in other. There is no way to distinguish these two cases automatically.

As non-English speaker I am familiar with the problems you described. This is a bug in the design of Python 2, and the only solution is using Python 3.

You can experiment with your idea, but I'm afraid that the patch will be more difficult than you expect and break the tests. I want to warn that even if your experiment is quite successful, there is not much chance to take it in 2.7. This is more like a new feature than a bug fix. Programs that depend on this feature will be incompatible with previous bugfix releases. It is unlikely to help the migration on Python 3, but rather would encourage writing code that is incompatible with Python 3.
msg265881 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2016-05-19 18:37
Agree with Steven; the whole reason Python 3 changed from unicode and str to str and bytes was because having Py2 str be text sometimes, and binary data at other times is confusing. The existing behavior can't change in Py2 in any meaningful way without breaking existing code, introducing special cases for text->text encodings (where Python 3 supports them using the codecs module only), behaving in non-obvious ways in corner cases, etc. Silently treating str.encode("utf-8") to mean "decode as UTF-8 and throw away the result to verify that it's already UTF-8 bytes" is not particularly intuitive either.

It does seem like a doc fix would be useful though; right now, we have only "String methods" documented, with no distinction between str and unicode. It might be helpful to explicitly deprecate str.encode on str objects, and unicode.decode, with a note that while it's meaningful to use these methods in Python 2 for text<->text encoding/decoding, the methods don't exist at all in Python 3.

Otherwise, yes, if you want consistent text/binary types, that's what Python 3 is for. Python 2 has tons of flaws when it comes to handling unicode (e.g. csv module), and fixing any given single problem (creating backward compatibility headaches in the process) is not worth the trouble.

If you're concerned about excessive boilerplate, just write a function (or a type) that allows you to perform the tests/conversions you care about as a single call. For example, the following seems like it achieves your objectives (one line usage, handles str by verifying that it's legal in provided encoding in strict mode, dropping/replacing characters in ignore/replace mode, etc.):

def basestringencode(s, encoding=sys.getdefaultencoding(), errors="strict"):
    if isinstance(s, str):
        # Decode with provided rules, so a str with illegal characters
        # raises exception, replaces, ignores, etc. per arguments
        s = s.decode(encoding, errors)
    return s.encode(encoding, errors)

If you don't want to see UnicodeDecodeError, you either pass 'ignore' for errors, or wrap the s.decode step in a try/except and raise a different exception type.

The biggest change I could see happening code wise would be a textual change to the UnicodeDecodeError error str.encode raises, so str.encode specifically replaces the default error message (but not type, for back compat reasons) with something like "str.encode cannot perform implicit decode with sys.getdefaultencoding(); use .encode only with unicode objects"
msg265913 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-05-20 08:01
Ben, the methods on stings and Unicode objects in Python 2.x are direct interfaces to the underlying codecs. The codecs can handle any number of input and output types, so there are some which only work on 8-bit strings (bytes) and others which take Unicode as input.

As a result, you sometimes see errors due to the conversion of an 8-bit string to Unicode (in the case, where the codec expects a Unicode input).

As example, take the UTF-8 codec. This expects a Unicode input when decoding, so when you pass in an 8-bit string, Python will convert this to Unicode using the default encoding (which is normally set to 'ascii') and then applies the codec operation.

When the 8-bit string is plain ASCII this works great. If not, chances are high that you'll run into a Unicode error.

Now, in Python 2.x you can change the default encoding to either make this work by assuming that all your 8-bit strings are UTF-8 (set it to 'utf-8' in sitecustomize.py), or you can disable the automatic conversion altogether by setting the default encoding to 'unknown', which is a codec specifically created for this purpose. The latter will also raise an exception when attempting to convert an 8-bit string to Unicode - similar to what Python 3 does, except that the error type is different.

Hope that helps.
msg265921 - (view) Author: Ben Spiller (benspiller) * Date: 2016-05-20 09:34
Thanks for considering this, anyway. I'll admit I'm disappointed we couldn't fix this on the 2.7 train, as to me fixing a method that takes an errors='ignore' argument and then throws an exception anyway seems a little more like a bug than a feature (and changing it would likely not affect behaviour in any existing non-broken programs), but if that's the decision then fine. Of course I'm aware (as I mentioned earlier on the thread) that the radically different unicode handling in python 3 solves this entirely and only wish it was practical to move our existing (enormous) codebase and customers over to it, but we're stuck with Python 2.7 - I believe lots of people are in the same situation unfortunately. 

As Josh suggested, perhaps we can at least add something to the doc for the str/unicode encode and decode methods so users are aware of the behaviour without trial and error. I'll update the component of this bug to reflect it's now considered a doc issue. 

Based on the inputs from Terry, and what seem to be the key info that would have been helpful to me and those who are hitting the same issues for the first time, I'd propose the following text (feel free to adjust as you see fit):

For encode:
"For most encodings, the return type is a byte str regardless of whether it is called on a str or unicode object. For example, call encode on a unicode object with "utf-8" to return a byte str object, or call encode on a str object with "base64" to return a base64-encoded str object.

It is _not_ recommended to use call this method on "str" objects when using codecs such as utf-8 that convert betweens str and unicode objects, as any characters not supported by python's default encoding (usually 7-bit ascii) will result in a UnicodeDecodeError exception, even if errors='ignore' was specified. For such conversions the str.decode and unicode.encode methods should be used. If you need to produce an encoded version of a string that could be either a str or unicode object, only call the encode() method after checking it is a unicode object not a str object, using isinstance(s, unicode)."

and for decode:
"The return type may be either str or unicode, depending on which encoding is used and whether the method is called on a str or unicode object. For example, call decode on a str object with "utf-8" to return a unicode object, or call decode on a unicode or str object with "base64" to return a base64-decoded str object.

It is _not_ recommended to use call this method on "unicode" objects when using codecs such as utf-8 that convert betweens str and unicode objects, as any characters not supported by python's default encoding (usually 7-bit ascii) will result in a UnicodeEncodeError exception, even if errors='ignore' was specified. For such conversions the str.decode and unicode.encode methods should be used. If you need to produce a decoded version of a string that could be either a str or unicode object, only call the decode() method after checking it is a str object not a unicode object, using isinstance(s, str)."
History
Date User Action Args
2016-05-20 09:34:40benspillersetmessages: + msg265921
components: + Documentation, - Interpreter Core
2016-05-20 08:01:11lemburgsetnosy: + lemburg
messages: + msg265913
2016-05-19 18:37:25josh.rsetnosy: + josh.r
messages: + msg265881
2016-05-19 17:54:55serhiy.storchakasetmessages: + msg265879
2016-05-19 17:32:48steven.dapranosetmessages: + msg265878
2016-05-19 16:23:59benspillersetmessages: + msg265873
2016-05-12 12:31:43serhiy.storchakasetmessages: + msg265399
2016-05-12 11:42:06benspillersetmessages: + msg265395
2016-05-12 11:20:00serhiy.storchakasetmessages: + msg265394
2016-05-12 11:02:53benspillersetmessages: + msg265392
2016-05-12 10:25:50serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg265390
2016-05-12 10:14:07benspillersetmessages: + msg265389
components: + Interpreter Core, - Documentation
title: doc for unicode.decode and str.encode is unnecessarily confusing -> unicode.decode and str.encode are unnecessarily confusing for non-ascii
2016-02-20 00:29:23terry.reedysetnosy: + terry.reedy
messages: + msg260542
2016-02-16 13:32:03ezio.melottisetnosy: + ezio.melotti
2016-02-16 13:01:38steven.dapranosetnosy: + steven.daprano
messages: + msg260361
2016-02-16 11:58:58benspillercreate