This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author benspiller
Recipients benspiller, docs@python, ezio.melotti, serhiy.storchaka, steven.daprano, terry.reedy
Date 2016-05-19.16:23:58
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1463675039.37.0.0219371069809.issue26369@psf.upfronthosting.co.za>
In-reply-to
Content
btw If anyone can find the place in the code (sorry I tried and failed!) where str.encode('utf-8', error=X) is resulting in an implicit call to the equivalent of decode(defaultencoding, errors=strict) (as suggested by the exception message) I think it'll be easier to discuss the details of fixing.

Thanks for your reply - yes I'm aware that theoretically you _could_ globally change python's default encoding from ascii, but the prevailing view I've heard from python developers seems to be that changing it is not a good idea and may cause lots of library code to break. Also it's probably not a good idea for individual libraries or modules to be changing global state that affects the entire python invocation, and it would be nice to find a less fragile and more out-of-the-box solution to this. You may well be using different encodings (not just utf-8) to be used in different parts of your program - so changing the globally-defined default encoding doesn't seem right, especially for a method like str.encode method that already takes an 'encoding' argument (used currently only for the encoding aspect, not the decoding aspect). 

I do think there's a strong case to be made for changing the str.encode (and also unicode.decode) behaviour so that str.encode('utf-8') behaves the same whether it's given ascii or non-ascii characters, and also similar to unicode.encode('utf-8'). Let me try to persuade you... :)

First, to address the point you made:

> If str.encode() raises a decoding exception, this is a programming bug. It would be bad to hide it.

I totally agree with the general principal of not hiding programming bugs. However if calling str.encode for codecs like utf8 (let's ignore base64 for now, which is a very different beast) was *consistently* treated as a 'programming bug' by python and always resulted in an exception that would be ok (suboptimal usability imho, but still ok), since programmers would quickly spot the problem and fix it. But that's not what happens - it *silently works* (is a no-op) as long as you happen to be using ASCII characters so this so-called 'programming bug' will go unnoticed by most programmers (and authors of third party library code you might be relying on!)... but the moment a non-ascii character get introduced suddenly you'll get an exception, maybe in some library code you rely on but can't fix. For this reason I don't think treating this as a programming bug is helping anyone write more robust python code - quite the reverse. Plus I think the behaviour of being a no-op is almost always 'what you would have wanted it to do' anyway, whereas the behaviour of throwing an exception almost never is. 

I think we'd agree that changing str.encode(utf8) to throw an exception in *all* cases wouldn't be a realistic option since it would certainly break backwards compatability in painful ways for many existing apps and library code. 

So, if we want to make the behaviour of this important built-in type a bit more consistent and less error-prone/fragile for this case then I think the only option is making str.encode be a no-op for non-ascii characters (at least, non-ascii characters that are valid in the specified encoding), just as it is for ascii characters. 

Here's why I think ditching the current behaviour would be a good idea:
- calling str.encode() and getting a DecodeError is confusing ("I asked you to encode this string, what are you decoding for?")
- calling str.encode('utf-8') and getting an exception about "ascii" is confusing as the only encoding I mentioned in the method call was utf-8
- calling encode(..., errors=ignore) and getting an exception is confusing and feels like a bug; I've explicitly specified that I do NOT want exceptions from calling this method, yet (because neither 'errors' nor 'encoding' argument gets passed to the implicit - and undocumented - decode operation), I get unexpected behaviour that is far more likely to break my program than a no-op
- the somewhat surprising behaviour we're talking about is not explicitly documented anywhere
- having str.encode throw on non-ascii but not ascii makes it very likely that code will be written and shipped (including library code you may have no control over) that *appears* to work under normal testing but has *hidden* bugs that surface only once non-ascii characters are used. 
- in every situation I can think of, having str.encode(encoding, errors=ignore) honour the encoding and errors arguments even for the implicit-decode operation is more useful than having it ignore those arguments and throw an exception
- a quick google shows lots of people in the Python community (from newbies to experts) are seeing this exception and being confused by it, therefore a lot of people's lives might be improved if we can somehow make the situation better :)
- even with the best of intentions (and with code written by senior python programmers who understand unicode issues well) it's very difficult in practice to write non-trivial python programs that always consistently use the 'unicode' string type throughout (especially when legacy code or third party libraries are involved), so most 'real' code needs to cope with a mix of str and unicode types in practice. So when you need to write a 'basestring' out to a file, you'd like to be able to simply call the s.encode(myencoding, errors=whatever) method that exists on both str and unicode types and have it 'work' whether it's a str already in that encoding or a unicode object that needs to be converted. This is a common use case and the behaviour I'm suggesting would really help with this case. The alternative is that every python programmer who cares about non-ascii characters has to write an pleasant and un-pythonic if clause to give different behaviour based on __type__, in every place they need a byte str:
	if isinstance(s, unicode): 
		f.write(s.encode('utf-8', errors='ignore'))
	else:
		f.write(s)

nb: Although I've used the example of str.encode above, unicode.decode has the exact same issues (and potential solution), and of course this isn't specific to utf-8 but to all codecs that convert between str and unicode (i.e. most of them except base64). 

I hope you'll consider this proposal - it's probably not a very big change, is very unlikely to break any existing/working code, and has the potential to help reduce fragility and difficult-to-resolve bugs in an area of Python that seems to cause pain and confusion to lots of people.

Thanks for considering!
History
Date User Action Args
2016-05-19 16:23:59benspillersetrecipients: + benspiller, terry.reedy, ezio.melotti, steven.daprano, docs@python, serhiy.storchaka
2016-05-19 16:23:59benspillersetmessageid: <1463675039.37.0.0219371069809.issue26369@psf.upfronthosting.co.za>
2016-05-19 16:23:59benspillerlinkissue26369 messages
2016-05-19 16:23:58benspillercreate