Issue 4770: binascii module, inconsistent behavior: some functions accept unicode string input

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49020

classification

Title:	binascii module, inconsistent behavior: some functions accept unicode string input
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	flox	Nosy List:	BreamoreBoy, amaury.forgeotdarc, barry, beazley, eric.araujo, flox, lemburg, pitrou, vstinner
Priority:	normal	Keywords:	patch

Created on 2008-12-29 18:00 by beazley, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
case_binascii.py	flox, 2009-12-14 19:03	Test case and output on Python 3.2
issue4770_binascii_py3k_v3.diff	flox, 2010-01-16 18:27	Patch with documentation, apply to py3k

Messages (17)
msg78470 - (view)	Author: David M. Beazley (beazley)	Date: 2008-12-29 18:00
See Issue 4869 for a related bug. Most of the functions in binascii are meant to go from binary data to textual representations (hex digits, base64, binhex, etc.). There are numerous problems: 1. Misleading error messages. For example: >>> binascii.b2a_base64("Some text") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: b2a_base64() argument 1 must be string or buffer, not str >>> binascii.crc32("Some text") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: crc32() argument 1 must be string or buffer, not str >>> Huh? Didn't I just pass a string? Error message should say "argument 1 must be bytes or buffer, not str". This problem shows up with most of the other encoding functions too (i.e., b2a_uu). 2. Expected behavior with encoding/decoding. The functions in this module are going from binary to text. To be consistent, I think the result of encoding operations such as b2a_uu(), b2a_base64(), should be strings, not bytes. Similarly, decoding operations are going from text back to bytes. I think the input arguments should be allowed to be str (in addition to bytes if you want).
msg78473 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-12-29 18:36
Item 1 was most probably fixed recently with r67929. Concerning item 2, I think it was decided that binascii is a bytes-only module. I suggest to close this issue as "out of date".
msg78475 - (view)	Author: David M. Beazley (beazley)	Date: 2008-12-29 18:54
Given the low-level nature of this module, I can understand the motivation to make it all bytes. However, I'm going to respectfully disagree with that and claim that making binascii all bytes really goes against the whole spirit of what Python 3.0 has tried to do for Unicode. For example, throughout Python, you now have a clean separation between binary data (bytes) and text data (str). Well, it's cleanly separated everywhere except in the binascii module (and base64 module) which, ironically, is all about converting between binary data and text. As it stands now, it's a huge wart IMHO.
msg78581 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-12-31 00:23
See also issue #4769 (want base64.b64decode(str)).
msg96394 - (view)	Author: Florent Xicluna (flox) *	Date: 2009-12-14 19:03
I agree, we need consistency between all functions of this package. I've run a small script to check what happens for all 16 functions of the binascii package when they receive unicode input... See attached script (and sample output). IMHO 4 functions should be fixed to raise the same TypeError: - a2b_hex - a2b_qp - unhexlify - rledecode_hqx Documentation says that the functions in `binary` module « convert between binary and various ASCII-encoded binary representations » So... it's all binary. Implicit encoding should not happen, never.
msg96399 - (view)	Author: Florent Xicluna (flox) *	Date: 2009-12-14 20:23
This patch removes implicit encoding in binascii functions: - a2b_hex (=unhexlify) - a2b_qp - rledecode_hqx * Tests module "test_binascii" is reviewed and simplified. * Fixes for "email", "pickle" and "quopri" modules to encode input. All tests pass.
msg96401 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-12-14 21:05
flox wrote: > > flox <laxyf@yahoo.fr> added the comment: > > This patch removes implicit encoding in binascii functions: > - a2b_hex (=unhexlify) > - a2b_qp > - rledecode_hqx > > * Tests module "test_binascii" is reviewed and simplified. > * Fixes for "email", "pickle" and "quopri" modules to encode input. > > All tests pass. Are you sure that this patch is correct (which RFC says that quoted printable should use our raw-unicode-escape codec ?): Index: Lib/email/message.py =================================================================== --- Lib/email/message.py (revision 76839) +++ Lib/email/message.py (working copy) @@ -198,6 +198,8 @@ return None cte = self.get('content-transfer-encoding', '').lower() if cte == 'quoted-printable': + if isinstance(payload, str): + payload = payload.encode('raw-unicode-escape') return utils._qdecode(payload) elif cte == 'base64': try: The patch also needs to fix the documentation and add a NEWS entry. Making these APIs strict in the sense that they don't accept str-instances anymore needs to be mentioned very clearly. We may even have to go through a deprecation process for them, since they can easily cause perfectly working code to suddenly fail.
msg96402 - (view)	Author: Florent Xicluna (flox) *	Date: 2009-12-14 21:25
> Are you sure that this patch is correct (which RFC says > that quoted printable should use our raw-unicode-escape > codec ?): I am not sure of anything. It is an "educated guess" at the most. Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'... See longer excerpt below. Index: Lib/email/message.py =================================================================== --- Lib/email/message.py (revision 76839) +++ Lib/email/message.py (working copy) @@ -189,24 +189,26 @@ elif not isinstance(self._payload, list): raise TypeError('Expected list, got %s' % type(self._payload)) else: payload = self._payload[i] if not decode: return payload # Decoded payloads always return bytes. XXX split this part out into # a new method called .get_decoded_payload(). if self.is_multipart(): return None cte = self.get('content-transfer-encoding', '').lower() if cte == 'quoted-printable': + if isinstance(payload, str): + payload = payload.encode('raw-unicode-escape') return utils._qdecode(payload) elif cte == 'base64': try: if isinstance(payload, str): payload = payload.encode('raw-unicode-escape') return base64.b64decode(payload) #return utils._bdecode(payload) except binascii.Error: # Incorrect padding pass elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'): in_file = BytesIO(payload.encode('raw-unicode-escape'))
msg96403 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-12-14 21:40
flox wrote: > > flox <laxyf@yahoo.fr> added the comment: > >> Are you sure that this patch is correct (which RFC says >> that quoted printable should use our raw-unicode-escape >> codec ?): > > I am not sure of anything. It is an "educated guess" at the most. > Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'... Quoted printable as well as the other two transfer encodings should be encodings that "fit" into the 7-bit ASCII default originally assumed for email messages, so 'ascii' appears to be the more natural choice. The choice of 'raw-unicode-escape' will cause strange error messages or hide errors, since it encodes non-ASCII code points using '\xNN' which these codecs don't supports: b'' >>> base64.b64decode('äöü'.encode('ascii')) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
msg96404 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-12-14 21:45
I agree with Marc-André that enforcing ASCII looks like the only sensible solution. Perhaps Barry knows the purpose of using "raw-unicode-escape" here, though?
msg96407 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-12-14 21:55
The email interface ate part of the code snippet. Here's the complete version: >>> import base64 >>> base64.b64decode('äöü'.encode('raw-unicode-escape')) b'' >>> base64.b64decode('äöü'.encode('ascii')) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
msg96414 - (view)	Author: Florent Xicluna (flox) *	Date: 2009-12-14 22:13
I perform a "grep" on the email package (with patch applied). There's some places where 'raw-unicode-escape' is used. I understand that all "payload.encode('raw-unicode-escape')" should be changed to "payload.encode('ascii')" when the payload is processed by another transfer encoding (line 202, 207 and 214). The last one (when 'content-transfer-encoding' is not recognized) should remain unchanged? All other uses of 'raw-unicode-escape' seem correct. Lib/email/base64mime.py:112: return a2b_base64(string.encode('raw-unicode-escape')) Lib/email/header.py:111: word = bytes(word, 'raw-unicode-escape') Lib/email/message.py:202: payload = payload.encode('raw-unicode-escape') Lib/email/message.py:207: payload = payload.encode('raw-unicode-escape') Lib/email/message.py:214: in_file = BytesIO(payload.encode('raw-unicode-escape')) Lib/email/message.py:225: return payload.encode('raw-unicode-escape') Lib/email/message.py:767: as_bytes = charset[2].encode('raw-unicode-escape') Lib/email/utils.py:298: rawbytes = bytes(text, 'raw-unicode-escape')
msg96418 - (view)	Author: Florent Xicluna (flox) *	Date: 2009-12-14 23:51
Patch updated: * Added documentation * Switched to "payload.encode('ascii')" in the "email.message" module Note: there was no ambiguity on the first line of the documentation: « [These functions] convert between binary and various ASCII-encoded binary representations » So, it should only convert bytes in range(256) to bytes in range(128).
msg97893 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-01-16 18:27
Patch updated, after 7703 is merged in py3k.
msg98381 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-01-26 21:11
Please also add a NEWS entry which mentions the changes to those APIs and the change in the email package. Thanks.
msg110836 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-19 23:16
Florent, could you add a NEWS entry as requested in msg98381, then we should be able to get this committed.
msg111733 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-07-27 21:28
Fixed with r83182.

History
Date	User	Action	Args
2022-04-11 14:56:43	admin	set	github: 49020
2010-07-27 21:28:28	flox	set	status: open -> closed versions: - Python 3.1 messages: + msg111733 assignee: flox resolution: fixed stage: patch review -> resolved
2010-07-19 23:16:27	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg110836
2010-02-17 16:27:49	eric.araujo	set	nosy: + eric.araujo
2010-01-26 21:11:20	lemburg	set	messages: + msg98381
2010-01-16 18:27:49	flox	set	files: + issue4770_binascii_py3k_v3.diff messages: + msg97893
2010-01-16 18:25:56	flox	set	files: - issue4770_binascii_py3k_v2.diff
2009-12-30 08:32:31	flox	set	title: binascii module, crazy error messages, unexpected behavior -> binascii module, inconsistent behavior: some functions accept unicode string input stage: patch review
2009-12-30 08:21:24	flox	set	files: - issue4770_binascii_py3k.diff
2009-12-14 23:54:16	flox	set	files: + issue4770_binascii_py3k_v2.diff
2009-12-14 23:53:59	flox	set	files: - issue4770_binascii_py3k_v2.diff
2009-12-14 23:51:50	flox	set	files: + issue4770_binascii_py3k_v2.diff messages: + msg96418
2009-12-14 22:13:17	flox	set	messages: + msg96414
2009-12-14 21:55:11	lemburg	set	messages: + msg96407
2009-12-14 21:45:55	pitrou	set	nosy: + barry, pitrou messages: + msg96404
2009-12-14 21:40:43	lemburg	set	messages: + msg96403
2009-12-14 21:25:24	flox	set	messages: + msg96402
2009-12-14 21:05:12	lemburg	set	nosy: + lemburg messages: + msg96401
2009-12-14 20:23:40	flox	set	files: + issue4770_binascii_py3k.diff keywords: + patch messages: + msg96399
2009-12-14 19:03:29	flox	set	files: + case_binascii.py versions: + Python 3.1, Python 3.2, - Python 3.0 nosy: + flox messages: + msg96394
2008-12-31 00:23:56	vstinner	set	nosy: + vstinner messages: + msg78581
2008-12-29 18:54:18	beazley	set	messages: + msg78475
2008-12-29 18:36:36	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg78473
2008-12-29 18:00:35	beazley	create