classification
Title: base64 does not properly handle unicode strings
Type: Stage:
Components: Unicode Versions: Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: mbecker, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2008-11-15 11:11 by mbecker, last changed 2008-11-26 16:50 by mbecker. This issue is now closed.

Messages (6)
msg75911 - (view) Author: Michael Becker (mbecker) Date: 2008-11-15 11:11
See below. unicode string causes exception. Explicitly converting it to
a regular string addresses the issue. I only noticed this because my
input string changed to unicode after updating python to 2.6 and django
to 1.0.

>>> import base64
>>>
a=u'aHR0cDovL3NvdXJjZWZvcmdlLm5ldC90cmFja2VyMi8_ZnVuYz1kZXRhaWwmYWlkPTIyNTg5MzUmZ3JvdXBfaWQ9MTI2OTQmYXRpZD0xMTI2OTQ='
>>> b=base64.urlsafe_b64decode(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/base64.py", line 112, in urlsafe_b64decode
    return b64decode(s, '-_')
  File "/usr/local/lib/python2.6/base64.py", line 71, in b64decode
    s = _translate(s, {altchars[0]: '+', altchars[1]: '/'})
  File "/usr/local/lib/python2.6/base64.py", line 36, in _translate
    return s.translate(''.join(translation))
TypeError: character mapping must return integer, None or unicode
>>> b=base64.urlsafe_b64decode(str(a))
>>> b
'http://sourceforge.net/tracker2/?func=detail&aid=2258935&group_id=12694&atid=112694'
msg76218 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-21 23:07
It's not a bug. base64 is a codec to encode *bytes* and characters. 
You have to encode your unicode string to bytes using a charset
 Example (utf-8):
>>> from base64 import b64encode, b64decode
>>> b64encode(u'a\xe9'.encode("utf-8"))
'YcOp'
>>> unicode(b64decode('YcOp'), "utf-8")
u'a\xe9'
msg76223 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2008-11-21 23:30
"This module provides data encoding and decoding as specified in RFC
3548. This standard defines the Base16, Base32, and Base64 algorithms
for encoding and decoding arbitrary binary strings into text strings
that can be safely sent by email, used as parts of URLs, or included as
part of an HTTP POST request. "

In other words, arbitrary 8-bit byte strings <=> 'safe' byte strings
You have to encode unicode to bytes first, as you did.  Str works
because you only have ascii chars and str uses the ascii encoder by
default.  The bytes() constructor has no default and 'ascii' must be
supplied

The error message is correct even if backwards. Unicode.translate
requires a unicode mapping, whereas b64decode supplies a bytes mapping
because it requires bytes.

3.0 added an earlier type check, so the same code gives
TypeError: expected bytes, not str

I believe there was an explicit decision to leave low-level wire-
protocol byte functions as bytes/bytearray only.

The 3.0 manual needs updating in this respect, but I will start another
issue for that.
msg76441 - (view) Author: Michael Becker (mbecker) Date: 2008-11-25 23:31
Terry,
Thanks for your response. My main concern was that the behavior changed
when updating from 2.5 to 2.6. The new behavior was not intuitive. Also
2.6, I thought, was supposed to be backward compatible.  Based on this
issue, I would assume this statement is not true when strings are passed
to any method that convert them to bytes. Maybe this was documented in
the 2.6 documentation somewhere and I simply missed it. Should I have
run the 2to3 converter on my 2.5 code prior to updating to 2.6? Please
let me know the new issue number so I can track the progress.
Thanks!
msg76469 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2008-11-26 15:20
2.6 is, as far as I know, intended to be backwards compatible except for
where it fixes bugs.  Upgrading to 2.6 does (should) not change strings
(type str) to unicode.  Only importing the appropriate __future__ or
upgrading to 3.0 will do that.  I have no idea what Django does.

The 3 lines of code you posted gives exactly the same traceback in my
copy of 2.5 as the one you posted.
msg76472 - (view) Author: Michael Becker (mbecker) Date: 2008-11-26 16:50
Terry,
I had a feeling Django had something to do with this. I'll have a closer
look there. For reference, in my django code, I did not explicitly
declare the string as a unicode string. Django must be importing
unicode_literals from __future__ as you suggested. I'll have a closer
look there. 

Just out of curiosity, would the 2to3 tool have resolved this issue come
3.0? Would it have change the type to a bytes? Or, would this issue need
to be caught in unit tests?

Thanks!
History
Date User Action Args
2008-11-26 16:50:23mbeckersetmessages: + msg76472
2008-11-26 15:20:58terry.reedysetmessages: + msg76469
2008-11-25 23:31:40mbeckersetmessages: + msg76441
2008-11-21 23:31:27terry.reedysetresolution: fixed -> not a bug
2008-11-21 23:30:46terry.reedysetnosy: + terry.reedy
messages: + msg76223
2008-11-21 23:07:03vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg76218
nosy: + vstinner
2008-11-15 11:11:24mbeckercreate