This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author benspiller
Recipients benspiller, docs@python, ezio.melotti, steven.daprano, terry.reedy
Date 2016-05-12.10:14:07
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1463048047.49.0.297094367952.issue26369@psf.upfronthosting.co.za>
In-reply-to
Content
Thanks that's really helpful

Having thought about it some more, I think if possible it'd be really so much better to actually 'fix' the behaviour for the unicode<->str standard codecs (i.e. not base64) rather than just documenting around it. The current behaviour is not only confusing but leads to bugs that are very easy to miss since the methods work correctly when given 7-bit ascii characters. 

I had a poke around in the python source but couldn't quite identify where it's happening - presumably there is somewhere in the str.encode('utf-8') implementation that first "decodes" the string and does so using the ascii codec. If it could be made to use the same encoding that was passed in (e.g. utf8) then this would end up being a no-op and there would be no unpleasant bugs that only appear when the input includes non-ascii characters. 

It would also allow X.encode('utf-8') to be called successfully whether X is already a str or is a unicode object, which would save callers having to explicitly check what kind of string they've been passed. 

Is anyone able to look into the code to see where this would need to be fixed and how difficult it would be to do? I have a feeling that once the line is located it might be quite a straightforward fix

Many thanks
History
Date User Action Args
2016-05-12 10:14:07benspillersetrecipients: + benspiller, terry.reedy, ezio.melotti, steven.daprano, docs@python
2016-05-12 10:14:07benspillersetmessageid: <1463048047.49.0.297094367952.issue26369@psf.upfronthosting.co.za>
2016-05-12 10:14:07benspillerlinkissue26369 messages
2016-05-12 10:14:07benspillercreate