Message 68750 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	salty-horse
Recipients	aalbrecht, barry, cjw296, salty-horse, splorgdar
Date	2008-06-25.19:42:04
SpamBayes Score	0.0024254664
Marked as misclassified	No
Message-id	<1214422926.06.0.764906749021.issue1974@psf.upfronthosting.co.za>
In-reply-to

Content
I think there's been a little misinterpretation of the standard in the comments above. It's important to note that RFC 2822 basically defines folding as "adding a CRLF before an existing whitespace in the original message". See http://tools.ietf.org/html/rfc2822#section-2.2.3 It does not allow prepending folded lines with extra characters that were not in the original message such as '\t' or ' '. This is exactly what _encode_chunks does in header.py: joiner = NL + self._continuation_ws (Note that the email package docs and Header docstring use the word 'prepend' which is reflects the error in the code). With a correct implementation, why would I want to choice of which type of character to line-break on when folding? The whole notion of controlling the value of continuation_ws seems wrong. However, changing the default continuation_ws to ' ', as the patch suggests, will output syntactically correct headers in the majority of cases (due to other bugs that remove trailing whitespace and merge consecutive whitespace into one character). All in all, I agree with the change of the default continuation_ws due to its lucky side-effects, but as Barry hinted, the algorithm needs some serious work to really output valid headers. Some examples of the good and bad behaviors: >>> from email.Header import Header >>> l = ['<%d@dom.ain>' % i for i in range(8)] >>> # this turns out fine >>> Header(' '.join(l), continuation_ws=' ').encode() '<0@dom.ain> <1@dom.ain> <2@dom.ain> <3@dom.ain> <4@dom.ain> <5@dom.ain>\n <6@dom.ain> <7@dom.ain>' # This does not fold even though it should >>> Header('\t'.join(l), continuation_ws=' ').encode() '<0@dom.ain>\t<1@dom.ain>\t<2@dom.ain>\t<3@dom.ain>\t<4@dom.ain>\t<5@dom.ain>\t<6@dom.ain>\t<7@dom.ain>' # And here the 4-char whitespace is shrinked into one >>> Header(' '.join(l), continuation_ws=' ').encode() '<0@dom.ain> <1@dom.ain> <2@dom.ain> <3@dom.ain> <4@dom.ain> <5@dom.ain>\n <6@dom.ain> <7@dom.ain>'

I think there's been a little misinterpretation of the standard in the
comments above.

It's important to note that RFC 2822 basically defines folding as
"adding a CRLF before an existing whitespace in the original message". 

See http://tools.ietf.org/html/rfc2822#section-2.2.3

It does *not* allow prepending folded lines with extra characters that
were not in the original message such as '\t' or ' '.

This is exactly what _encode_chunks does in header.py:
    joiner = NL + self._continuation_ws

(Note that the email package docs and Header docstring use the word
'prepend' which is reflects the error in the code).

With a correct implementation, why would I want to choice of which type
of character to line-break on when folding?
The whole notion of controlling the value of continuation_ws seems wrong.

However, changing the default continuation_ws to ' ', as the patch
suggests, will output syntactically correct headers in the majority of
cases (due to other bugs that remove trailing whitespace and merge
consecutive whitespace into one character).


All in all, I agree with the change of the default continuation_ws due
to its lucky side-effects, but as Barry hinted, the algorithm needs some
serious work to really output valid headers.

Some examples of the good and bad behaviors:

>>> from email.Header import Header
>>> l = ['<%d@dom.ain>' % i for i in range(8)]

>>> # this turns out fine
>>> Header(' '.join(l), continuation_ws=' ').encode()
'<0@dom.ain> <1@dom.ain> <2@dom.ain> <3@dom.ain> <4@dom.ain>
<5@dom.ain>\n <6@dom.ain> <7@dom.ain>'

# This does not fold even though it should
>>> Header('\t'.join(l), continuation_ws=' ').encode()
'<0@dom.ain>\t<1@dom.ain>\t<2@dom.ain>\t<3@dom.ain>\t<4@dom.ain>\t<5@dom.ain>\t<6@dom.ain>\t<7@dom.ain>'

# And here the 4-char whitespace is shrinked into one
>>> Header('    '.join(l), continuation_ws=' ').encode()
'<0@dom.ain> <1@dom.ain> <2@dom.ain> <3@dom.ain> <4@dom.ain>
<5@dom.ain>\n <6@dom.ain> <7@dom.ain>'

History
Date	User	Action	Args
2008-06-25 19:42:06	salty-horse	set	spambayes_score: 0.00242547 -> 0.0024254664 recipients: + salty-horse, barry, aalbrecht, cjw296, splorgdar
2008-06-25 19:42:06	salty-horse	set	spambayes_score: 0.00242547 -> 0.00242547 messageid: <1214422926.06.0.764906749021.issue1974@psf.upfronthosting.co.za>
2008-06-25 19:42:05	salty-horse	link	issue1974 messages
2008-06-25 19:42:04	salty-horse	create