classification
Title: binascii module, inconsistent behavior: some functions accept unicode string input
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: flox Nosy List: BreamoreBoy, amaury.forgeotdarc, barry, beazley, eric.araujo, flox, lemburg, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2008-12-29 18:00 by beazley, last changed 2010-07-27 21:28 by flox. This issue is now closed.

Files
File name Uploaded Description Edit
case_binascii.py flox, 2009-12-14 19:03 Test case and output on Python 3.2
issue4770_binascii_py3k_v3.diff flox, 2010-01-16 18:27 Patch with documentation, apply to py3k
Messages (17)
msg78470 - (view) Author: David M. Beazley (beazley) Date: 2008-12-29 18:00
See Issue 4869 for a related bug.

Most of the functions in binascii are meant to go from binary data to 
textual representations (hex digits, base64, binhex, etc.).  There are 
numerous problems:

1.  Misleading error messages.  For example:

>>> binascii.b2a_base64("Some text")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: b2a_base64() argument 1 must be string or buffer, not str
>>> binascii.crc32("Some text")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: crc32() argument 1 must be string or buffer, not str
>>>

Huh?  Didn't I just pass a string?   Error message should say "argument 
1 must be bytes or buffer, not str".

This problem shows up with most of the other encoding functions too 
(i.e., b2a_uu).

2. Expected behavior with encoding/decoding.

The functions in this module are going from binary to text. To be 
consistent, I think the result of encoding operations such as b2a_uu(),
b2a_base64(), should be strings, not bytes.

Similarly, decoding operations are going from text back to bytes.  I 
think the input arguments should be allowed to be str (in addition to 
bytes if you want).
msg78473 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-12-29 18:36
Item 1 was most probably fixed recently with r67929.
Concerning item 2, I think it was decided that binascii is a bytes-only
module.

I suggest to close this issue as "out of date".
msg78475 - (view) Author: David M. Beazley (beazley) Date: 2008-12-29 18:54
Given the low-level nature of this module, I can understand the 
motivation to make it all bytes.

However, I'm going to respectfully disagree with that and claim that 
making binascii all bytes really goes against the whole spirit of what 
Python 3.0 has tried to do for Unicode.  For example, throughout Python, 
you now have a clean separation between binary data (bytes) and text 
data (str).   Well, it's cleanly separated everywhere except in the 
binascii module (and base64 module) which, ironically, is all about 
converting between binary data and text.

As it stands now, it's a huge wart IMHO.
msg78581 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-12-31 00:23
See also issue #4769 (want base64.b64decode(str)).
msg96394 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-14 19:03
I agree, we need consistency between all functions of this package.

I've run a small script to check what happens for all 16 functions of
the binascii package when they receive unicode input...

See attached script (and sample output).

IMHO 4 functions should be fixed to raise the same TypeError:
 - a2b_hex
 - a2b_qp
 - unhexlify
 - rledecode_hqx 

Documentation says that the functions in `binary` module « convert
between binary and various ASCII-encoded binary representations »
So... it's all binary.

Implicit encoding should not happen, never.
msg96399 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-14 20:23
This patch removes implicit encoding in binascii functions:
 - a2b_hex (=unhexlify)
 - a2b_qp
 - rledecode_hqx 

 * Tests module "test_binascii" is reviewed and simplified.
 * Fixes for "email", "pickle" and "quopri" modules to encode input.

All tests pass.
msg96401 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-14 21:05
flox wrote:
> 
> flox <laxyf@yahoo.fr> added the comment:
> 
> This patch removes implicit encoding in binascii functions:
>  - a2b_hex (=unhexlify)
>  - a2b_qp
>  - rledecode_hqx 
> 
>  * Tests module "test_binascii" is reviewed and simplified.
>  * Fixes for "email", "pickle" and "quopri" modules to encode input.
> 
> All tests pass.

Are you sure that this patch is correct (which RFC says
that quoted printable should use our raw-unicode-escape
codec ?):

Index: Lib/email/message.py
===================================================================
--- Lib/email/message.py	(revision 76839)
+++ Lib/email/message.py	(working copy)
@@ -198,6 +198,8 @@
             return None
         cte = self.get('content-transfer-encoding', '').lower()
         if cte == 'quoted-printable':
+            if isinstance(payload, str):
+                payload = payload.encode('raw-unicode-escape')
             return utils._qdecode(payload)
         elif cte == 'base64':
             try:

The patch also needs to fix the documentation and add a NEWS
entry.

Making these APIs strict in the sense that they don't accept
str-instances anymore needs to be mentioned very clearly.

We may even have to go through a deprecation process for them,
since they can easily cause perfectly working code to suddenly
fail.
msg96402 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-14 21:25
> Are you sure that this patch is correct (which RFC says
> that quoted printable should use our raw-unicode-escape
> codec ?):

I am not sure of anything. It is an "educated guess" at the most.
Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'...

See longer excerpt below.


Index: Lib/email/message.py
===================================================================
--- Lib/email/message.py        (revision 76839)
+++ Lib/email/message.py        (working copy)
@@ -189,24 +189,26 @@
         elif not isinstance(self._payload, list):
             raise TypeError('Expected list, got %s' % type(self._payload))
         else:
             payload = self._payload[i]
         if not decode:
             return payload
         # Decoded payloads always return bytes.  XXX split this part
out into
         # a new method called .get_decoded_payload().
         if self.is_multipart():
             return None
         cte = self.get('content-transfer-encoding', '').lower()
         if cte == 'quoted-printable':
+            if isinstance(payload, str):
+                payload = payload.encode('raw-unicode-escape')
             return utils._qdecode(payload)
         elif cte == 'base64':
             try:
                 if isinstance(payload, str):
                     payload = payload.encode('raw-unicode-escape')
                 return base64.b64decode(payload)
                 #return utils._bdecode(payload)
             except binascii.Error:
                 # Incorrect padding
                 pass
         elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'):
             in_file = BytesIO(payload.encode('raw-unicode-escape'))
msg96403 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-14 21:40
flox wrote:
> 
> flox <laxyf@yahoo.fr> added the comment:
> 
>> Are you sure that this patch is correct (which RFC says
>> that quoted printable should use our raw-unicode-escape
>> codec ?):
> 
> I am not sure of anything. It is an "educated guess" at the most.
> Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'...

Quoted printable as well as the other two transfer encodings should
be encodings that "fit" into the 7-bit ASCII default originally
assumed for email messages, so 'ascii' appears to be the more
natural choice.

The choice of 'raw-unicode-escape' will cause strange error
messages or hide errors, since it encodes non-ASCII code points
using '\xNN' which these codecs don't supports:

b''
>>> base64.b64decode('äöü'.encode('ascii'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
msg96404 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-14 21:45
I agree with Marc-André that enforcing ASCII looks like the only
sensible solution. Perhaps Barry knows the purpose of using
"raw-unicode-escape" here, though?
msg96407 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-12-14 21:55
The email interface ate part of the code snippet. Here's the complete
version:

>>> import base64
>>> base64.b64decode('äöü'.encode('raw-unicode-escape'))
b''
>>> base64.b64decode('äöü'.encode('ascii'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)
msg96414 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-14 22:13
I perform a "grep" on the email package (with patch applied).
There's some places where 'raw-unicode-escape' is used.

I understand that all "payload.encode('raw-unicode-escape')" should be
changed to "payload.encode('ascii')" when the payload is processed by
another transfer encoding (line 202, 207 and 214).
The last one (when 'content-transfer-encoding' is not recognized) should
remain unchanged?

All other uses of 'raw-unicode-escape' seem correct.



Lib/email/base64mime.py:112:
        return a2b_base64(string.encode('raw-unicode-escape'))
Lib/email/header.py:111:
        word = bytes(word, 'raw-unicode-escape')
Lib/email/message.py:202:
        payload = payload.encode('raw-unicode-escape')
Lib/email/message.py:207:
        payload = payload.encode('raw-unicode-escape')
Lib/email/message.py:214:
        in_file = BytesIO(payload.encode('raw-unicode-escape'))
Lib/email/message.py:225:
        return payload.encode('raw-unicode-escape')
Lib/email/message.py:767:
        as_bytes = charset[2].encode('raw-unicode-escape')
Lib/email/utils.py:298:
        rawbytes = bytes(text, 'raw-unicode-escape')
msg96418 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-14 23:51
Patch updated:
 * Added documentation
 * Switched to "payload.encode('ascii')" in the "email.message" module

Note: there was no ambiguity on the first line of the documentation:
« [These functions] convert between binary and various ASCII-encoded
binary representations »

So, it should only convert bytes in range(256) to bytes in range(128).
msg97893 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-16 18:27
Patch updated, after 7703 is merged in py3k.
msg98381 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-26 21:11
Please also add a NEWS entry which mentions the changes to those APIs and the change in the email package.

Thanks.
msg110836 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-19 23:16
Florent, could you add a NEWS entry as requested in msg98381, then we should be able to get this committed.
msg111733 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-07-27 21:28
Fixed with r83182.
History
Date User Action Args
2010-07-27 21:28:28floxsetstatus: open -> closed
versions: - Python 3.1
messages: + msg111733

assignee: flox
resolution: fixed
stage: patch review -> resolved
2010-07-19 23:16:27BreamoreBoysetnosy: + BreamoreBoy
messages: + msg110836
2010-02-17 16:27:49eric.araujosetnosy: + eric.araujo
2010-01-26 21:11:20lemburgsetmessages: + msg98381
2010-01-16 18:27:49floxsetfiles: + issue4770_binascii_py3k_v3.diff

messages: + msg97893
2010-01-16 18:25:56floxsetfiles: - issue4770_binascii_py3k_v2.diff
2009-12-30 08:32:31floxsettitle: binascii module, crazy error messages, unexpected behavior -> binascii module, inconsistent behavior: some functions accept unicode string input
stage: patch review
2009-12-30 08:21:24floxsetfiles: - issue4770_binascii_py3k.diff
2009-12-14 23:54:16floxsetfiles: + issue4770_binascii_py3k_v2.diff
2009-12-14 23:53:59floxsetfiles: - issue4770_binascii_py3k_v2.diff
2009-12-14 23:51:50floxsetfiles: + issue4770_binascii_py3k_v2.diff

messages: + msg96418
2009-12-14 22:13:17floxsetmessages: + msg96414
2009-12-14 21:55:11lemburgsetmessages: + msg96407
2009-12-14 21:45:55pitrousetnosy: + barry, pitrou
messages: + msg96404
2009-12-14 21:40:43lemburgsetmessages: + msg96403
2009-12-14 21:25:24floxsetmessages: + msg96402
2009-12-14 21:05:12lemburgsetnosy: + lemburg
messages: + msg96401
2009-12-14 20:23:40floxsetfiles: + issue4770_binascii_py3k.diff
keywords: + patch
messages: + msg96399
2009-12-14 19:03:29floxsetfiles: + case_binascii.py
versions: + Python 3.1, Python 3.2, - Python 3.0
nosy: + flox

messages: + msg96394
2008-12-31 00:23:56vstinnersetnosy: + vstinner
messages: + msg78581
2008-12-29 18:54:18beazleysetmessages: + msg78475
2008-12-29 18:36:36amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg78473
2008-12-29 18:00:35beazleycreate