classification
Title: email.encoders.encode_7or8bit(): typo "iso-2202". "iso-2022" is correct.
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: barry, l0nwlf, r.david.murray, ynkdir
Priority: low Keywords: easy, patch

Created on 2009-12-10 17:43 by ynkdir, last changed 2010-05-06 01:54 by r.david.murray. This issue is now closed.

Files
File name Uploaded Description Edit
encoders.diff l0nwlf, 2010-04-06 20:19 encoders.diff to remove typo
test_email.patch l0nwlf, 2010-05-04 22:55 unit-test
encoders.patch l0nwlf, 2010-05-05 01:42 patchfile with unittest to fix the issue
Messages (12)
msg96211 - (view) Author: Yukihiro Nakadaira (ynkdir) Date: 2009-12-10 17:43
email.encoders.encode_7or8bit(): typo "iso-2202".  "iso-2022" is correct.

Index: Lib/email/encoders.py
===================================================================
--- Lib/email/encoders.py	(revision 76749)
+++ Lib/email/encoders.py	(working copy)
@@ -62,7 +62,7 @@
         # iso-2022-* is non-ASCII but still 7-bit
         charset = msg.get_charset()
         output_cset = charset and charset.output_charset
-        if output_cset and output_cset.lower().startswith('iso-2202-'):
+        if output_cset and output_cset.lower().startswith('iso-2022-'):
             msg['Content-Transfer-Encoding'] = '7bit'
         else:
             msg['Content-Transfer-Encoding'] = '8bit'
msg98043 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-01-19 02:11
Hmm.  I suspect that that typo is fortunate rather than the source of a bug.  As far as I can see, if a message contains valid iso-2022 characters, it will always encode to ASCII successfully and therefore be given a content-transfer-encoding of 7bit.  If on the other hand it contains 8 bit characters, it should be marked as 8bit, even if it claims to be iso-2022.  (Well, actually it should throw an error, but that's a different bug.)

In other words, I think the correct thing to do is to delete that if test.

Do you have a case where the code produces incorrect behavior that your patch turns into correct behavior?
msg98047 - (view) Author: Yukihiro Nakadaira (ynkdir) Date: 2010-01-19 08:14
> In other words, I think the correct thing to do is to delete that if test.

I think so too.

> Do you have a case where the code produces incorrect behavior that your patch turns into correct behavior?

No, I don't.  I just found a typo.

The code for "iso-2022" was added by issue #804885.  But I don't know why it was requested.
msg102492 - (view) Author: Shashwat Anand (l0nwlf) Date: 2010-04-06 20:19
I am submitting the patch as suggested by Yukihiro
msg102493 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-04-06 20:45
Well, as I said, I think it would be better to delete the conditional instead.  (That is, that fixing the typo is more likely to lead to a bug).
msg102548 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-04-07 16:48
After tracing the code further I now understand what the original patch is doing, and yes the spelling correction is the correct fix.  (The issue is that there are two different encodings involved: the input encoding in which the body is encoded when encode_7or8 bit is called, and the 'output' encoding that will be used to render the message when it is serialized to bytes.  In the case of iso-2022, it is used as the output encoding for at least one input encoding that is an eight bit encoding.  iso-2022 is pretty much the *only* encoding that appears as an output encoding different from the input encoding.)

Applying the patch will have to wait until the branch unfreezes.  I also think that I will not apply it to 2.6, since it is not fixing a bug that anyone has reported as breaking their code, but does represent a behavior change.

Since we have proven in the course of analyzing this that the code is untested, perhaps someone would like to write some unit tests for it?
msg102949 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-04-12 14:53
I've applied the fix to trunk in r79994, and py3k in r79996, but I'm leaving this open because I still want a unit test for it.
msg104983 - (view) Author: Shashwat Anand (l0nwlf) Date: 2010-05-04 22:55
Adding unit-test for the patch
msg104994 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-05-05 01:11
Comments on patch:

We prefer patches to be generated from the top level directory of the checkout, so that it can be applied by doing 'patch -p0 <xxx.patch' from the top level directory without having to look in the patch file to see what directory it was generated in.

The test is correct in general outline, but the thing that needs to be tested is that when a byte string in a character encoding that is eight bit, but whose output encoding (the encoding email will use when writing the message) is 7bit, that 7bit will be chosen for the transfer encoding.  If you look in Lib/email/charsets.py, there are only two such character sets, euc-jp and shift-jis.

We discussed this in IRC, and you found a euc-js character, but then said that the test passed even with the fix removed.  The byte string you are using in the bytes test you posted does not appear to be encodable in the output character set (iso-2022-jp). I'm guessing you used this, and _charset='iso-2022-jp', because otherwise the test passes without the fix.

That is, if put in a character, such as 文 ('\xca\xb8' as an euc-jp encoded byte stream), and pass _encoding='euc-jp', the test passes without the fix.  So, you were exactly right in what you said in IRC, and should have posted that version of the unit test :)

Looking at the code even more carefully than I did last time, it turns out that as soon as the 'charset' is set, the payload gets translated from the input character set to the output character set, and *then* encode_7or8bit is called.  As far as I have been able to figure out, there is no way for encode_7or8bit to get called with the payload encoded in the input character set (not even if it is called directly, since it is passed a message instance and so set_charset must already have been called on the message instance it is passed).

So it turns out that the if test is not needed after all.
msg104996 - (view) Author: Shashwat Anand (l0nwlf) Date: 2010-05-05 01:41
I was getting a feel that the inner if-test is not being used in any case.  Submitting the patch which removes the inner if test and adding a unittest.
msg105063 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-05-05 17:50
Committed to trunk in r80800.

However, when I ported it to py3k, it turns out the test fails there, but passes if the original fix from this issue has been applied.  More investigation is needed, but clearly something changed in the payload encoding logic (which is not too surprising).
msg105121 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-05-06 01:54
It turns out that email5 (py3k), because it is using unicode for the payload, doesn't do the encoding to the output character set until later in the process.  Specifically, charset.body_encode no longer does the input-to-output charset conversion.  The if test in the exception clause in encoders.encode_7or8bit really is needed in email5.  So in r80855 I just port the test to py3k.  Typo fix and test also merged to 3.1 in r80856.

Hopefully that's the last we need to do on this issue.
History
Date User Action Args
2010-05-06 01:54:28r.david.murraysetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg105121

stage: test needed -> resolved
2010-05-05 17:50:27r.david.murraysetmessages: + msg105063
2010-05-05 01:42:33l0nwlfsetfiles: + encoders.patch
2010-05-05 01:41:15l0nwlfsetmessages: + msg104996
2010-05-05 01:11:01r.david.murraysetmessages: + msg104994
2010-05-04 22:55:04l0nwlfsetfiles: + test_email.patch
keywords: + patch
messages: + msg104983
2010-04-12 14:53:33r.david.murraysetkeywords: + easy, - patch

messages: + msg102949
2010-04-07 16:48:19r.david.murraysetpriority: normal -> low
resolution: accepted
messages: + msg102548

versions: - Python 2.6
2010-04-06 20:45:59r.david.murraysetassignee: barry -> r.david.murray
messages: + msg102493
2010-04-06 20:19:34l0nwlfsetfiles: + encoders.diff

nosy: + l0nwlf
messages: + msg102492

keywords: + patch
2010-01-19 08:14:06ynkdirsetmessages: + msg98047
2010-01-19 02:12:27r.david.murraysetversions: + Python 2.6, Python 3.1, Python 2.7, Python 3.2
2010-01-19 02:12:00r.david.murraysetpriority: normal

nosy: + r.david.murray
messages: + msg98043

stage: test needed
2009-12-10 19:24:42benjamin.petersonsetassignee: barry

nosy: + barry
2009-12-10 17:43:01ynkdircreate