This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: email.Generator should use unknown-8bit encoded words for headers with 8 bit data
Type: behavior Stage: resolved
Components: Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: barry, r.david.murray, sjt
Priority: high Keywords: patch

Created on 2010-12-12 18:01 by r.david.murray, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
email_unknown-8bit.patch r.david.murray, 2011-01-07 03:31
Messages (8)
msg123842 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-12 18:01
This is a followon to Issue 4661.  The fix for that issue introduced a way to parse messages containing 8bit bytes.  When Generator is called on a model containing 8 bit bytes, it converts it to 7bit clean.  There is, however, a bug in this conversion process: currently when encountering 8bit bytes in headers, it simply replaces then with ?.  According to the RFCs[*], what it should do instead is to replace them with encoded words using the 'charset' "unknown-8bit".

[*] I'm specifically referring to RFC 1428...email is effectively acting as a translating gateway when requested to do the 8bit to 7bit conversion.  Although that RFC does not explicitly say that the unknown-8bit charset should be used in encoded words, it does imply it strongly in its section 3 prescription.
msg125615 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-07 03:31
Here is a patch.  Three of the tests currently fail due to what appears to be a bug in the Header formatting routines.  I'll have to look in to that before finishing this issue.

Note that doing str on a message with binary headers can produce overlong lines, since str does not limit line widths.  generator.flatten does, though, so in that case the lengthened lines are correctly rewrapped.  (Well, as correctly as Header rewraps any headers, at least, which is not all that well in certain cases).
msg125616 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-07 03:38
I have a little bit of concern whether or not 'unknown-8bit' is the correct charset to use.  It seems to be the one in the RFCs, but I have a feeling it may not be what is used "in the wild" in headers, so I am looking for opinions.
msg125619 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2011-01-07 04:25
I agree with you that according to RFC1428, use of unknown-8bit is implicitly recommended.  However, note that the RFC itself is not standards-track.  I agree with your interpretation that in this context the email module should be considered a gateway.  I think it is certainly best to convert to MIME words, as you say.

However, if there isn't already, maybe there should be an option to bounce such headers back to the user?  That is, in an interactive application this should be an error.  Of course we should help the user by allowing and documenting (perhaps even defaulting to) whatever we choose for the unknown encoding.

I don't recall ever seeing unknown-8bit in the wild.  What I do see in the wild a lot, and specifically in Mailman moderation traffic, is simply "unknown".

A quick google for "unknown-8bit" pulled up some old (2002) discussion of unknown-8bit causing problems for some MTAs.  I didn't follow up to see what those were.

I don't have time to do it myself today (but would be willing to help out if you can wait up to two weeks -- I have travel coming up), but I suggest checking for IANA registration of "unknown" and "unknown-8bit".
msg125642 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-07 12:37
Well, unknown-8bit is registered as a charset with IANA.  It is registered specifically for use in message bodies, but as a registered charset it "should" be acceptable in headers as well.  There is no similar registration for just 'unknown', but it sounds like mailers may be more likely to accept it if it exists in the wild.

I'm hoping to fix this before the RC (which is tomorrow, which means fixing it today), so your suggestion of making the 'unknown charset' token configurable is a good one.  I'm not so worried about providing a way to reject such headers, since this incarnation of email makes a point of not throwing errors on parsing, and if you read binary messages with unknown bytes the best thing to do is generate the outgoing message with BytesGenerator, in which case you get the unknown bytes back without the rfc2047 munging.
msg125648 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2011-01-07 15:41
I'm a little uncomfortable with relying on a non-standards track RFC for this interpretation, and I'm also not sure I'd say that the email package is a "transport agent", but in cases where it's acting on the user's behalf (i.e. headers created programmatically rather than parsed), I can get on board with that.  Your interpretation and approach to the fix seems reasonable, and I don't have any better ideas.
msg125657 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-07 16:26
Well, since unknown-8bit is a registered charset, it should be RFC-valid in an encoded word.  Whether or not any other mailer out there is going to be able to handle it is a different question.
msg125728 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-01-07 23:28
Committed a revised version of the patch, including doc updates, in r87840.  While I haven't documented the way to alter what encoding name is used for the unknown bytes, I did make it possible to do so (set charset.UNKNOWN8BIT to the desired string).
History
Date User Action Args
2022-04-11 14:57:10adminsetgithub: 54895
2011-01-07 23:28:37r.david.murraysetstatus: open -> closed
nosy: barry, r.david.murray, sjt
messages: + msg125728

resolution: fixed
stage: needs patch -> resolved
2011-01-07 16:26:35r.david.murraysetnosy: barry, r.david.murray, sjt
messages: + msg125657
2011-01-07 15:41:35barrysetnosy: barry, r.david.murray, sjt
messages: + msg125648
2011-01-07 12:37:16r.david.murraysetnosy: barry, r.david.murray, sjt
messages: + msg125642
2011-01-07 04:25:23sjtsetnosy: barry, r.david.murray, sjt
messages: + msg125619
2011-01-07 03:38:05r.david.murraysetnosy: + sjt
messages: + msg125616
2011-01-07 03:32:25r.david.murraysetnosy: + barry
2011-01-07 03:31:53r.david.murraysetfiles: + email_unknown-8bit.patch

messages: + msg125615
keywords: + patch
2010-12-12 18:01:47r.david.murraycreate