Author tlynn
Recipients jafo, kael, tlynn
Date 2009-02-03.17:01:59
SpamBayes Score 0.00147471
Marked as misclassified No
Message-id <1233680522.35.0.022264315991.issue1079@psf.upfronthosting.co.za>
In-reply-to
Content
The only difference between the two regexps is that the email/header.py
version looks for::

  (?=[ \t]|$)           # whitespace or the end of the string

at the end (with re.MULTILINE, so $ also matches '\n').

To expand on "There is nothing about that thing in RFC 2047", it says::

   IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
   by an RFC 822 parser.

RFC 822 says::

   atom        =  1*<any CHAR except specials, SPACE and CTLs>
      ...
   specials    =  "(" / ")" / "<" / ">" / "@"  ; Must be in quoted-
               /  "," / ";" / ":" / "\" / <">  ;  string, to use
               /  "." / "[" / "]"              ;  within a word.

So an example of mis-parsing is::

   >>> import email.header
   >>> h = '=?utf-8?q?=E2=98=BA?=(unicode white smiling face)'
   >>> email.header.decode_header(h)
   [('=?utf-8?q?=E2=98=BA?=(unicode white smiling face)', None)]

The correct result would be::

   >>> email.header.decode_header(h)
   [('\xe2\x98\xba', 'utf-8'), ('(unicode white smiling face)', None)]

which is what you get if you insert a space before the '(' in h.
History
Date User Action Args
2009-02-03 17:02:02tlynnsetrecipients: + tlynn, jafo, kael
2009-02-03 17:02:02tlynnsetmessageid: <1233680522.35.0.022264315991.issue1079@psf.upfronthosting.co.za>
2009-02-03 17:02:00tlynnlinkissue1079 messages
2009-02-03 17:01:59tlynncreate