Message 265389 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	benspiller
Recipients	benspiller, docs@python, ezio.melotti, steven.daprano, terry.reedy
Date	2016-05-12.10:14:07
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1463048047.49.0.297094367952.issue26369@psf.upfronthosting.co.za>
In-reply-to

Content
Thanks that's really helpful Having thought about it some more, I think if possible it'd be really so much better to actually 'fix' the behaviour for the unicode<->str standard codecs (i.e. not base64) rather than just documenting around it. The current behaviour is not only confusing but leads to bugs that are very easy to miss since the methods work correctly when given 7-bit ascii characters. I had a poke around in the python source but couldn't quite identify where it's happening - presumably there is somewhere in the str.encode('utf-8') implementation that first "decodes" the string and does so using the ascii codec. If it could be made to use the same encoding that was passed in (e.g. utf8) then this would end up being a no-op and there would be no unpleasant bugs that only appear when the input includes non-ascii characters. It would also allow X.encode('utf-8') to be called successfully whether X is already a str or is a unicode object, which would save callers having to explicitly check what kind of string they've been passed. Is anyone able to look into the code to see where this would need to be fixed and how difficult it would be to do? I have a feeling that once the line is located it might be quite a straightforward fix Many thanks

Thanks that's really helpful

Having thought about it some more, I think if possible it'd be really so much better to actually 'fix' the behaviour for the unicode<->str standard codecs (i.e. not base64) rather than just documenting around it. The current behaviour is not only confusing but leads to bugs that are very easy to miss since the methods work correctly when given 7-bit ascii characters. 

I had a poke around in the python source but couldn't quite identify where it's happening - presumably there is somewhere in the str.encode('utf-8') implementation that first "decodes" the string and does so using the ascii codec. If it could be made to use the same encoding that was passed in (e.g. utf8) then this would end up being a no-op and there would be no unpleasant bugs that only appear when the input includes non-ascii characters. 

It would also allow X.encode('utf-8') to be called successfully whether X is already a str or is a unicode object, which would save callers having to explicitly check what kind of string they've been passed. 

Is anyone able to look into the code to see where this would need to be fixed and how difficult it would be to do? I have a feeling that once the line is located it might be quite a straightforward fix

Many thanks

History
Date	User	Action	Args
2016-05-12 10:14:07	benspiller	set	recipients: + benspiller, terry.reedy, ezio.melotti, steven.daprano, docs@python
2016-05-12 10:14:07	benspiller	set	messageid: <1463048047.49.0.297094367952.issue26369@psf.upfronthosting.co.za>
2016-05-12 10:14:07	benspiller	link	issue26369 messages
2016-05-12 10:14:07	benspiller	create