Message 84595 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tony_nelson
Recipients	barry, tony_nelson
Date	2009-03-30.17:59:34
SpamBayes Score	1.7965689e-06
Marked as misclassified	No
Message-id	<1238435977.23.0.604038870745.issue5610@psf.upfronthosting.co.za>
In-reply-to

Content
feedparser.py does not pares mixed newlines properly. NLCRE_eol, which is used to search for the various newlines at End Of Line, uses $ to match the end of string, but $ also matches \n$, due to a wise long-ago patch by the Effbot. This causes feedparser to match '\r\n\n' at '\r\n', and then to remove the last two characters, leaving '\r', thus eating up a line. Such mixed line endings can occur if a message with CRLF line endings is parsed, written out, and then parsed again. When explicitly searching for various newlines, the \Z end-of-string marker should be used instead. There are two improper uses of $ in feedparser.py. I don't see any others in the email package. NLCRE_eol = re.compile('(\r\n\|\r\|\n)$') should be: NLCRE_eol = re.compile('(\r\n\|\r\|\n)\Z') and boundary_re also needs the fix. I can write a test. Where exactly should it be put?

feedparser.py does not pares mixed newlines properly.  NLCRE_eol, which
is used to search for the various newlines at End Of Line, uses $ to
match the end of string, but $ also matches \n$, due to a wise long-ago
patch by the Effbot.  This causes feedparser to match '\r\n\n' at
'\r\n', and then to remove the last two characters, leaving '\r', thus
eating up a line.  Such mixed line endings can occur if a message with
CRLF line endings is parsed, written out, and then parsed again.

When explicitly searching for various newlines, the \Z end-of-string
marker should be used instead.  There are two improper uses of $ in
feedparser.py.  I don't see any others in the email package.

NLCRE_eol = re.compile('(\r\n|\r|\n)$')

should be:

NLCRE_eol = re.compile('(\r\n|\r|\n)\Z')

and boundary_re also needs the fix.

I can write a test.  Where exactly should it be put?

History
Date	User	Action	Args
2009-03-30 17:59:37	tony_nelson	set	recipients: + tony_nelson, barry
2009-03-30 17:59:37	tony_nelson	set	messageid: <1238435977.23.0.604038870745.issue5610@psf.upfronthosting.co.za>
2009-03-30 17:59:36	tony_nelson	link	issue5610 messages
2009-03-30 17:59:35	tony_nelson	create