Issue 7472: email.encoders.encode_7or8bit(): typo "iso-2202". "iso-2022" is correct.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51721

classification

Title:	email.encoders.encode_7or8bit(): typo "iso-2202". "iso-2022" is correct.
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.1, Python 3.2, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	r.david.murray	Nosy List:	barry, l0nwlf, r.david.murray, ynkdir
Priority:	low	Keywords:	easy, patch

Created on 2009-12-10 17:43 by ynkdir, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
encoders.diff	l0nwlf, 2010-04-06 20:19	encoders.diff to remove typo
test_email.patch	l0nwlf, 2010-05-04 22:55	unit-test
encoders.patch	l0nwlf, 2010-05-05 01:42	patchfile with unittest to fix the issue

Messages (12)
msg96211 - (view)	Author: Yukihiro Nakadaira (ynkdir)	Date: 2009-12-10 17:43
email.encoders.encode_7or8bit(): typo "iso-2202". "iso-2022" is correct. Index: Lib/email/encoders.py =================================================================== --- Lib/email/encoders.py (revision 76749) +++ Lib/email/encoders.py (working copy) @@ -62,7 +62,7 @@ # iso-2022-* is non-ASCII but still 7-bit charset = msg.get_charset() output_cset = charset and charset.output_charset - if output_cset and output_cset.lower().startswith('iso-2202-'): + if output_cset and output_cset.lower().startswith('iso-2022-'): msg['Content-Transfer-Encoding'] = '7bit' else: msg['Content-Transfer-Encoding'] = '8bit'
msg98043 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-01-19 02:11
Hmm. I suspect that that typo is fortunate rather than the source of a bug. As far as I can see, if a message contains valid iso-2022 characters, it will always encode to ASCII successfully and therefore be given a content-transfer-encoding of 7bit. If on the other hand it contains 8 bit characters, it should be marked as 8bit, even if it claims to be iso-2022. (Well, actually it should throw an error, but that's a different bug.) In other words, I think the correct thing to do is to delete that if test. Do you have a case where the code produces incorrect behavior that your patch turns into correct behavior?
msg98047 - (view)	Author: Yukihiro Nakadaira (ynkdir)	Date: 2010-01-19 08:14
> In other words, I think the correct thing to do is to delete that if test. I think so too. > Do you have a case where the code produces incorrect behavior that your patch turns into correct behavior? No, I don't. I just found a typo. The code for "iso-2022" was added by issue #804885. But I don't know why it was requested.
msg102492 - (view)	Author: Shashwat Anand (l0nwlf)	Date: 2010-04-06 20:19
I am submitting the patch as suggested by Yukihiro
msg102493 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-04-06 20:45
Well, as I said, I think it would be better to delete the conditional instead. (That is, that fixing the typo is more likely to lead to a bug).
msg102548 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-04-07 16:48
After tracing the code further I now understand what the original patch is doing, and yes the spelling correction is the correct fix. (The issue is that there are two different encodings involved: the input encoding in which the body is encoded when encode_7or8 bit is called, and the 'output' encoding that will be used to render the message when it is serialized to bytes. In the case of iso-2022, it is used as the output encoding for at least one input encoding that is an eight bit encoding. iso-2022 is pretty much the only encoding that appears as an output encoding different from the input encoding.) Applying the patch will have to wait until the branch unfreezes. I also think that I will not apply it to 2.6, since it is not fixing a bug that anyone has reported as breaking their code, but does represent a behavior change. Since we have proven in the course of analyzing this that the code is untested, perhaps someone would like to write some unit tests for it?
msg102949 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-04-12 14:53
I've applied the fix to trunk in r79994, and py3k in r79996, but I'm leaving this open because I still want a unit test for it.
msg104983 - (view)	Author: Shashwat Anand (l0nwlf)	Date: 2010-05-04 22:55
Adding unit-test for the patch
msg104994 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-05-05 01:11
Comments on patch: We prefer patches to be generated from the top level directory of the checkout, so that it can be applied by doing 'patch -p0 <xxx.patch' from the top level directory without having to look in the patch file to see what directory it was generated in. The test is correct in general outline, but the thing that needs to be tested is that when a byte string in a character encoding that is eight bit, but whose output encoding (the encoding email will use when writing the message) is 7bit, that 7bit will be chosen for the transfer encoding. If you look in Lib/email/charsets.py, there are only two such character sets, euc-jp and shift-jis. We discussed this in IRC, and you found a euc-js character, but then said that the test passed even with the fix removed. The byte string you are using in the bytes test you posted does not appear to be encodable in the output character set (iso-2022-jp). I'm guessing you used this, and _charset='iso-2022-jp', because otherwise the test passes without the fix. That is, if put in a character, such as 文 ('\xca\xb8' as an euc-jp encoded byte stream), and pass _encoding='euc-jp', the test passes without the fix. So, you were exactly right in what you said in IRC, and should have posted that version of the unit test :) Looking at the code even more carefully than I did last time, it turns out that as soon as the 'charset' is set, the payload gets translated from the input character set to the output character set, and then encode_7or8bit is called. As far as I have been able to figure out, there is no way for encode_7or8bit to get called with the payload encoded in the input character set (not even if it is called directly, since it is passed a message instance and so set_charset must already have been called on the message instance it is passed). So it turns out that the if test is not needed after all.
msg104996 - (view)	Author: Shashwat Anand (l0nwlf)	Date: 2010-05-05 01:41
I was getting a feel that the inner if-test is not being used in any case. Submitting the patch which removes the inner if test and adding a unittest.
msg105063 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-05-05 17:50
Committed to trunk in r80800. However, when I ported it to py3k, it turns out the test fails there, but passes if the original fix from this issue has been applied. More investigation is needed, but clearly something changed in the payload encoding logic (which is not too surprising).
msg105121 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-05-06 01:54
It turns out that email5 (py3k), because it is using unicode for the payload, doesn't do the encoding to the output character set until later in the process. Specifically, charset.body_encode no longer does the input-to-output charset conversion. The if test in the exception clause in encoders.encode_7or8bit really is needed in email5. So in r80855 I just port the test to py3k. Typo fix and test also merged to 3.1 in r80856. Hopefully that's the last we need to do on this issue.

History
Date	User	Action	Args
2022-04-11 14:56:55	admin	set	github: 51721
2010-05-06 01:54:28	r.david.murray	set	status: open -> closed resolution: accepted -> fixed messages: + msg105121 stage: test needed -> resolved
2010-05-05 17:50:27	r.david.murray	set	messages: + msg105063
2010-05-05 01:42:33	l0nwlf	set	files: + encoders.patch
2010-05-05 01:41:15	l0nwlf	set	messages: + msg104996
2010-05-05 01:11:01	r.david.murray	set	messages: + msg104994
2010-05-04 22:55:04	l0nwlf	set	files: + test_email.patch keywords: + patch messages: + msg104983
2010-04-12 14:53:33	r.david.murray	set	keywords: + easy, - patch messages: + msg102949
2010-04-07 16:48:19	r.david.murray	set	priority: normal -> low resolution: accepted messages: + msg102548 versions: - Python 2.6
2010-04-06 20:45:59	r.david.murray	set	assignee: barry -> r.david.murray messages: + msg102493
2010-04-06 20:19:34	l0nwlf	set	files: + encoders.diff nosy: + l0nwlf messages: + msg102492 keywords: + patch
2010-01-19 08:14:06	ynkdir	set	messages: + msg98047
2010-01-19 02:12:27	r.david.murray	set	versions: + Python 2.6, Python 3.1, Python 2.7, Python 3.2
2010-01-19 02:12:00	r.david.murray	set	priority: normal nosy: + r.david.murray messages: + msg98043 stage: test needed
2009-12-10 19:24:42	benjamin.peterson	set	assignee: barry nosy: + barry
2009-12-10 17:43:01	ynkdir	create