This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Email.quopriprime over-encodes characters
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: gkuenning, r.david.murray
Priority: normal Keywords:

Created on 2017-12-13 01:17 by gkuenning, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (5)
msg308181 - (view) Author: Geoff Kuenning (gkuenning) Date: 2017-12-13 01:17
Email.quopriprime creates a map of header and body bytes that need no encoding:

for c in b'-!*+/' + ascii_letters.encode('ascii') + digits.encode('ascii'):
    _QUOPRI_HEADER_MAP[c] = chr(c)

This map is overly restrictive; in fact only two printable characters need to be omitted: the space and the equals sign.  The following revision to the loop creates a correct table:

for c in list(range(33, 61)) + list(range(62, 127)):
    _QUOPRI_HEADER_MAP[c] = chr(c)

Why does this matter?  Well, first, it's wasteful since it creates messages with larger headers than necessary.  But more important, it makes it impossible for other tools to operate on the messages unless they're encoding aware; for example, one can't easily grep for "foo@bar.com" because the at sign is encoded as =40.
msg308184 - (view) Author: Geoff Kuenning (gkuenning) Date: 2017-12-13 01:28
Oops, that loop is a bit too generous.  Here's a better one:

for c in list(range(33, 61)) + [62] + list(range(64, 95)) + list(range(96,127)):
msg308186 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-12-13 02:32
From RFC 2047:

(3) As a replacement for a 'word' entity within a 'phrase', for example,
    one that precedes an address in a From, To, or Cc header.  The ABNF
    definition for 'phrase' from RFC 822 thus becomes:

    phrase = 1*( encoded-word / word )

    In this case the set of characters that may be used in a "Q"-encoded
    'encoded-word' is restricted to: <upper and lower case ASCII
    letters, decimal digits, "!", "*", "+", "-", "/", "=", and "_"
    (underscore, ASCII 95.)>.  An 'encoded-word' that appears within a
    'phrase' MUST be separated from any adjacent 'word', 'text' or
    'special' by 'linear-white-space'.

The reason for this is that things like '@' are syntactically significant in headers and so must be encoded.
msg308187 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-12-13 02:54
And of course tools can grep for "foo@bar.com": you can't use encoded words in an address, only in the display name.

However, it occurs to me that in fact the restriction applies only to phrases, so one could use a less restrictive character set in an unstructured header such as the Subject, and that would indeed be nice.  The old header folder (python 2.7 and python 3.x compat32 policy) can't do it, because they don't know anything about the syntax of the headers they fold, they just use a bunch of heuristics.  The new policies in python3, however, use a smarter folder from _header_value_parser, and that *does* have access to the full parse tree for the header, and so could make smart decisions about which character set to use for the encoded word encoding.

If you'd like to try your hand at a PR implementing this idea, I'll be happy to provide advice and do a review.  It's not going to be anywhere near as simple as the one line change you proposed here, though :)
msg308267 - (view) Author: Geoff Kuenning (gkuenning) Date: 2017-12-13 23:45
I should have read that part of RFC 2047 before I submitted.

I'd love to claim that I'm going to write a patch that would do as you suggest.  But the reality is that I'm unlikely to find the time, so I'm going to be wise for once and avoid promising what I can't deliver.
History
Date User Action Args
2022-04-11 14:58:55adminsetgithub: 76479
2017-12-13 23:45:09gkuenningsetmessages: + msg308267
2017-12-13 02:54:11r.david.murraysetmessages: + msg308187
2017-12-13 02:32:04r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg308186

resolution: not a bug
stage: resolved
2017-12-13 01:28:07gkuenningsetmessages: + msg308184
2017-12-13 01:17:17gkuenningcreate