Message 332290 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mjpieters
Recipients	barry, era, mjpieters, r.david.murray
Date	2018-12-21.11:34:21
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1545392061.32.0.788709270274.issue35547@psf.upfronthosting.co.za>
In-reply-to

Content
While RFC2047 clearly states that an encoder MUST not split multi-byte encodings in the middle of a character (section 5, "Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded-word's.), it also states that to fit length restrictions, CRLF SPACE is used as a delimiter between encoded words (section 2, "If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used."). In section 6.2 it states When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.) (linear-white-space is the RFC822 term for foldable whitespace). The parser is leaving spaces between two encoded-word tokens in place, where it must remove them instead. And it is doing so correctly for unstructured headers, just not in get_bare_quoted_string, get_atom and get_dot_atom. Then there is Postel's law (be liberal in what you accept from others), and the email package already applies that principle to RFC2047 elsewhere; RFC2047 also states that "An 'encoded-word' MUST NOT appear within a 'quoted-string'." yet email._header_value_parser's handling of quoted-string will process EW sections.

While RFC2047 clearly states that an encoder MUST not split multi-byte encodings in the middle of a character (section 5, "Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded-word's.), it also states that to fit length restrictions, CRLF SPACE is used as a delimiter between encoded words (section 2, "If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used."). In section 6.2 it states

   When displaying a particular header field that contains multiple
   'encoded-word's, any 'linear-white-space' that separates a pair of
   adjacent 'encoded-word's is ignored.  (This is to allow the use of
   multiple 'encoded-word's to represent long strings of unencoded text,
   without having to separate 'encoded-word's where spaces occur in the
   unencoded text.)

(linear-white-space is the RFC822 term for foldable whitespace).

The parser is leaving spaces between two encoded-word tokens in place, where it must remove them instead. And it is doing so correctly for unstructured headers, just not in get_bare_quoted_string, get_atom and get_dot_atom.

Then there is Postel's law (*be liberal in what you accept from others*), and the email package already applies that principle to RFC2047 elsewhere; RFC2047 also states that "An 'encoded-word' MUST NOT appear within a 'quoted-string'." yet email._header_value_parser's handling of quoted-string will process EW sections.

History
Date	User	Action	Args
2018-12-21 11:34:21	mjpieters	set	recipients: + mjpieters, barry, r.david.murray, era
2018-12-21 11:34:21	mjpieters	set	messageid: <1545392061.32.0.788709270274.issue35547@psf.upfronthosting.co.za>
2018-12-21 11:34:21	mjpieters	link	issue35547 messages
2018-12-21 11:34:21	mjpieters	create