Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

email.Generator should use unknown-8bit encoded words for headers with 8 bit data #54895

Closed
bitdancer opened this issue Dec 12, 2010 · 8 comments
Assignees
Labels
type-bug An unexpected behavior, bug, or error

Comments

@bitdancer
Copy link
Member

BPO 10686
Nosy @warsaw, @bitdancer, @yaseppochi
Files
  • email_unknown-8bit.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/bitdancer'
    closed_at = <Date 2011-01-07.23:28:37.421>
    created_at = <Date 2010-12-12.18:01:47.210>
    labels = ['type-bug']
    title = 'email.Generator should use unknown-8bit encoded words for headers with 8 bit data'
    updated_at = <Date 2011-01-07.23:28:37.420>
    user = 'https://github.com/bitdancer'

    bugs.python.org fields:

    activity = <Date 2011-01-07.23:28:37.420>
    actor = 'r.david.murray'
    assignee = 'r.david.murray'
    closed = True
    closed_date = <Date 2011-01-07.23:28:37.421>
    closer = 'r.david.murray'
    components = []
    creation = <Date 2010-12-12.18:01:47.210>
    creator = 'r.david.murray'
    dependencies = []
    files = ['20297']
    hgrepos = []
    issue_num = 10686
    keywords = ['patch']
    message_count = 8.0
    messages = ['123842', '125615', '125616', '125619', '125642', '125648', '125657', '125728']
    nosy_count = 3.0
    nosy_names = ['barry', 'r.david.murray', 'sjt']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue10686'
    versions = ['Python 3.2']

    @bitdancer
    Copy link
    Member Author

    This is a followon to bpo-4661. The fix for that issue introduced a way to parse messages containing 8bit bytes. When Generator is called on a model containing 8 bit bytes, it converts it to 7bit clean. There is, however, a bug in this conversion process: currently when encountering 8bit bytes in headers, it simply replaces then with ?. According to the RFCs[*], what it should do instead is to replace them with encoded words using the 'charset' "unknown-8bit".

    [*] I'm specifically referring to RFC 1428...email is effectively acting as a translating gateway when requested to do the 8bit to 7bit conversion. Although that RFC does not explicitly say that the unknown-8bit charset should be used in encoded words, it does imply it strongly in its section 3 prescription.

    @bitdancer bitdancer self-assigned this Dec 12, 2010
    @bitdancer bitdancer added the type-bug An unexpected behavior, bug, or error label Dec 12, 2010
    @bitdancer
    Copy link
    Member Author

    Here is a patch. Three of the tests currently fail due to what appears to be a bug in the Header formatting routines. I'll have to look in to that before finishing this issue.

    Note that doing str on a message with binary headers can produce overlong lines, since str does not limit line widths. generator.flatten does, though, so in that case the lengthened lines are correctly rewrapped. (Well, as correctly as Header rewraps any headers, at least, which is not all that well in certain cases).

    @bitdancer
    Copy link
    Member Author

    I have a little bit of concern whether or not 'unknown-8bit' is the correct charset to use. It seems to be the one in the RFCs, but I have a feeling it may not be what is used "in the wild" in headers, so I am looking for opinions.

    @yaseppochi
    Copy link
    Mannequin

    yaseppochi mannequin commented Jan 7, 2011

    I agree with you that according to RFC1428, use of unknown-8bit is implicitly recommended. However, note that the RFC itself is not standards-track. I agree with your interpretation that in this context the email module should be considered a gateway. I think it is certainly best to convert to MIME words, as you say.

    However, if there isn't already, maybe there should be an option to bounce such headers back to the user? That is, in an interactive application this should be an error. Of course we should help the user by allowing and documenting (perhaps even defaulting to) whatever we choose for the unknown encoding.

    I don't recall ever seeing unknown-8bit in the wild. What I do see in the wild a lot, and specifically in Mailman moderation traffic, is simply "unknown".

    A quick google for "unknown-8bit" pulled up some old (2002) discussion of unknown-8bit causing problems for some MTAs. I didn't follow up to see what those were.

    I don't have time to do it myself today (but would be willing to help out if you can wait up to two weeks -- I have travel coming up), but I suggest checking for IANA registration of "unknown" and "unknown-8bit".

    @bitdancer
    Copy link
    Member Author

    Well, unknown-8bit is registered as a charset with IANA. It is registered specifically for use in message bodies, but as a registered charset it "should" be acceptable in headers as well. There is no similar registration for just 'unknown', but it sounds like mailers may be more likely to accept it if it exists in the wild.

    I'm hoping to fix this before the RC (which is tomorrow, which means fixing it today), so your suggestion of making the 'unknown charset' token configurable is a good one. I'm not so worried about providing a way to reject such headers, since this incarnation of email makes a point of not throwing errors on parsing, and if you read binary messages with unknown bytes the best thing to do is generate the outgoing message with BytesGenerator, in which case you get the unknown bytes back without the rfc2047 munging.

    @warsaw
    Copy link
    Member

    warsaw commented Jan 7, 2011

    I'm a little uncomfortable with relying on a non-standards track RFC for this interpretation, and I'm also not sure I'd say that the email package is a "transport agent", but in cases where it's acting on the user's behalf (i.e. headers created programmatically rather than parsed), I can get on board with that. Your interpretation and approach to the fix seems reasonable, and I don't have any better ideas.

    @bitdancer
    Copy link
    Member Author

    Well, since unknown-8bit is a registered charset, it should be RFC-valid in an encoded word. Whether or not any other mailer out there is going to be able to handle it is a different question.

    @bitdancer
    Copy link
    Member Author

    Committed a revised version of the patch, including doc updates, in r87840. While I haven't documented the way to alter what encoding name is used for the unknown bytes, I did make it possible to do so (set charset.UNKNOWN8BIT to the desired string).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants