msg78470 - (view) |
Author: David M. Beazley (beazley) |
Date: 2008-12-29 18:00 |
See Issue 4869 for a related bug.
Most of the functions in binascii are meant to go from binary data to
textual representations (hex digits, base64, binhex, etc.). There are
numerous problems:
1. Misleading error messages. For example:
>>> binascii.b2a_base64("Some text")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: b2a_base64() argument 1 must be string or buffer, not str
>>> binascii.crc32("Some text")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: crc32() argument 1 must be string or buffer, not str
>>>
Huh? Didn't I just pass a string? Error message should say "argument
1 must be bytes or buffer, not str".
This problem shows up with most of the other encoding functions too
(i.e., b2a_uu).
2. Expected behavior with encoding/decoding.
The functions in this module are going from binary to text. To be
consistent, I think the result of encoding operations such as b2a_uu(),
b2a_base64(), should be strings, not bytes.
Similarly, decoding operations are going from text back to bytes. I
think the input arguments should be allowed to be str (in addition to
bytes if you want).
|
msg78473 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * |
Date: 2008-12-29 18:36 |
Item 1 was most probably fixed recently with r67929.
Concerning item 2, I think it was decided that binascii is a bytes-only
module.
I suggest to close this issue as "out of date".
|
msg78475 - (view) |
Author: David M. Beazley (beazley) |
Date: 2008-12-29 18:54 |
Given the low-level nature of this module, I can understand the
motivation to make it all bytes.
However, I'm going to respectfully disagree with that and claim that
making binascii all bytes really goes against the whole spirit of what
Python 3.0 has tried to do for Unicode. For example, throughout Python,
you now have a clean separation between binary data (bytes) and text
data (str). Well, it's cleanly separated everywhere except in the
binascii module (and base64 module) which, ironically, is all about
converting between binary data and text.
As it stands now, it's a huge wart IMHO.
|
msg78581 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2008-12-31 00:23 |
See also issue #4769 (want base64.b64decode(str)).
|
msg96394 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2009-12-14 19:03 |
I agree, we need consistency between all functions of this package.
I've run a small script to check what happens for all 16 functions of
the binascii package when they receive unicode input...
See attached script (and sample output).
IMHO 4 functions should be fixed to raise the same TypeError:
- a2b_hex
- a2b_qp
- unhexlify
- rledecode_hqx
Documentation says that the functions in `binary` module « convert
between binary and various ASCII-encoded binary representations »
So... it's all binary.
Implicit encoding should not happen, never.
|
msg96399 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2009-12-14 20:23 |
This patch removes implicit encoding in binascii functions:
- a2b_hex (=unhexlify)
- a2b_qp
- rledecode_hqx
* Tests module "test_binascii" is reviewed and simplified.
* Fixes for "email", "pickle" and "quopri" modules to encode input.
All tests pass.
|
msg96401 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2009-12-14 21:05 |
flox wrote:
>
> flox <laxyf@yahoo.fr> added the comment:
>
> This patch removes implicit encoding in binascii functions:
> - a2b_hex (=unhexlify)
> - a2b_qp
> - rledecode_hqx
>
> * Tests module "test_binascii" is reviewed and simplified.
> * Fixes for "email", "pickle" and "quopri" modules to encode input.
>
> All tests pass.
Are you sure that this patch is correct (which RFC says
that quoted printable should use our raw-unicode-escape
codec ?):
Index: Lib/email/message.py
===================================================================
--- Lib/email/message.py (revision 76839)
+++ Lib/email/message.py (working copy)
@@ -198,6 +198,8 @@
return None
cte = self.get('content-transfer-encoding', '').lower()
if cte == 'quoted-printable':
+ if isinstance(payload, str):
+ payload = payload.encode('raw-unicode-escape')
return utils._qdecode(payload)
elif cte == 'base64':
try:
The patch also needs to fix the documentation and add a NEWS
entry.
Making these APIs strict in the sense that they don't accept
str-instances anymore needs to be mentioned very clearly.
We may even have to go through a deprecation process for them,
since they can easily cause perfectly working code to suddenly
fail.
|
msg96402 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2009-12-14 21:25 |
> Are you sure that this patch is correct (which RFC says
> that quoted printable should use our raw-unicode-escape
> codec ?):
I am not sure of anything. It is an "educated guess" at the most.
Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'...
See longer excerpt below.
Index: Lib/email/message.py
===================================================================
--- Lib/email/message.py (revision 76839)
+++ Lib/email/message.py (working copy)
@@ -189,24 +189,26 @@
elif not isinstance(self._payload, list):
raise TypeError('Expected list, got %s' % type(self._payload))
else:
payload = self._payload[i]
if not decode:
return payload
# Decoded payloads always return bytes. XXX split this part
out into
# a new method called .get_decoded_payload().
if self.is_multipart():
return None
cte = self.get('content-transfer-encoding', '').lower()
if cte == 'quoted-printable':
+ if isinstance(payload, str):
+ payload = payload.encode('raw-unicode-escape')
return utils._qdecode(payload)
elif cte == 'base64':
try:
if isinstance(payload, str):
payload = payload.encode('raw-unicode-escape')
return base64.b64decode(payload)
#return utils._bdecode(payload)
except binascii.Error:
# Incorrect padding
pass
elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'):
in_file = BytesIO(payload.encode('raw-unicode-escape'))
|
msg96403 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2009-12-14 21:40 |
flox wrote:
>
> flox <laxyf@yahoo.fr> added the comment:
>
>> Are you sure that this patch is correct (which RFC says
>> that quoted printable should use our raw-unicode-escape
>> codec ?):
>
> I am not sure of anything. It is an "educated guess" at the most.
> Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'...
Quoted printable as well as the other two transfer encodings should
be encodings that "fit" into the 7-bit ASCII default originally
assumed for email messages, so 'ascii' appears to be the more
natural choice.
The choice of 'raw-unicode-escape' will cause strange error
messages or hide errors, since it encodes non-ASCII code points
using '\xNN' which these codecs don't supports:
b''
>>> base64.b64decode('äöü'.encode('ascii'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
|
msg96404 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-12-14 21:45 |
I agree with Marc-André that enforcing ASCII looks like the only
sensible solution. Perhaps Barry knows the purpose of using
"raw-unicode-escape" here, though?
|
msg96407 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2009-12-14 21:55 |
The email interface ate part of the code snippet. Here's the complete
version:
>>> import base64
>>> base64.b64decode('äöü'.encode('raw-unicode-escape'))
b''
>>> base64.b64decode('äöü'.encode('ascii'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)
|
msg96414 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2009-12-14 22:13 |
I perform a "grep" on the email package (with patch applied).
There's some places where 'raw-unicode-escape' is used.
I understand that all "payload.encode('raw-unicode-escape')" should be
changed to "payload.encode('ascii')" when the payload is processed by
another transfer encoding (line 202, 207 and 214).
The last one (when 'content-transfer-encoding' is not recognized) should
remain unchanged?
All other uses of 'raw-unicode-escape' seem correct.
Lib/email/base64mime.py:112:
return a2b_base64(string.encode('raw-unicode-escape'))
Lib/email/header.py:111:
word = bytes(word, 'raw-unicode-escape')
Lib/email/message.py:202:
payload = payload.encode('raw-unicode-escape')
Lib/email/message.py:207:
payload = payload.encode('raw-unicode-escape')
Lib/email/message.py:214:
in_file = BytesIO(payload.encode('raw-unicode-escape'))
Lib/email/message.py:225:
return payload.encode('raw-unicode-escape')
Lib/email/message.py:767:
as_bytes = charset[2].encode('raw-unicode-escape')
Lib/email/utils.py:298:
rawbytes = bytes(text, 'raw-unicode-escape')
|
msg96418 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2009-12-14 23:51 |
Patch updated:
* Added documentation
* Switched to "payload.encode('ascii')" in the "email.message" module
Note: there was no ambiguity on the first line of the documentation:
« [These functions] convert between binary and various ASCII-encoded
binary representations »
So, it should only convert bytes in range(256) to bytes in range(128).
|
msg97893 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-01-16 18:27 |
Patch updated, after 7703 is merged in py3k.
|
msg98381 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2010-01-26 21:11 |
Please also add a NEWS entry which mentions the changes to those APIs and the change in the email package.
Thanks.
|
msg110836 - (view) |
Author: Mark Lawrence (BreamoreBoy) * |
Date: 2010-07-19 23:16 |
Florent, could you add a NEWS entry as requested in msg98381, then we should be able to get this committed.
|
msg111733 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-07-27 21:28 |
Fixed with r83182.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:43 | admin | set | github: 49020 |
2010-07-27 21:28:28 | flox | set | status: open -> closed versions:
- Python 3.1 messages:
+ msg111733
assignee: flox resolution: fixed stage: patch review -> resolved |
2010-07-19 23:16:27 | BreamoreBoy | set | nosy:
+ BreamoreBoy messages:
+ msg110836
|
2010-02-17 16:27:49 | eric.araujo | set | nosy:
+ eric.araujo
|
2010-01-26 21:11:20 | lemburg | set | messages:
+ msg98381 |
2010-01-16 18:27:49 | flox | set | files:
+ issue4770_binascii_py3k_v3.diff
messages:
+ msg97893 |
2010-01-16 18:25:56 | flox | set | files:
- issue4770_binascii_py3k_v2.diff |
2009-12-30 08:32:31 | flox | set | title: binascii module, crazy error messages, unexpected behavior -> binascii module, inconsistent behavior: some functions accept unicode string input stage: patch review |
2009-12-30 08:21:24 | flox | set | files:
- issue4770_binascii_py3k.diff |
2009-12-14 23:54:16 | flox | set | files:
+ issue4770_binascii_py3k_v2.diff |
2009-12-14 23:53:59 | flox | set | files:
- issue4770_binascii_py3k_v2.diff |
2009-12-14 23:51:50 | flox | set | files:
+ issue4770_binascii_py3k_v2.diff
messages:
+ msg96418 |
2009-12-14 22:13:17 | flox | set | messages:
+ msg96414 |
2009-12-14 21:55:11 | lemburg | set | messages:
+ msg96407 |
2009-12-14 21:45:55 | pitrou | set | nosy:
+ barry, pitrou messages:
+ msg96404
|
2009-12-14 21:40:43 | lemburg | set | messages:
+ msg96403 |
2009-12-14 21:25:24 | flox | set | messages:
+ msg96402 |
2009-12-14 21:05:12 | lemburg | set | nosy:
+ lemburg messages:
+ msg96401
|
2009-12-14 20:23:40 | flox | set | files:
+ issue4770_binascii_py3k.diff keywords:
+ patch messages:
+ msg96399
|
2009-12-14 19:03:29 | flox | set | files:
+ case_binascii.py versions:
+ Python 3.1, Python 3.2, - Python 3.0 nosy:
+ flox
messages:
+ msg96394
|
2008-12-31 00:23:56 | vstinner | set | nosy:
+ vstinner messages:
+ msg78581 |
2008-12-29 18:54:18 | beazley | set | messages:
+ msg78475 |
2008-12-29 18:36:36 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg78473 |
2008-12-29 18:00:35 | beazley | create | |