Message 216908 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	valhallasw
Recipients	barry, r.david.murray, valhallasw
Date	2014-04-20.15:58:15
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1398009498.59.0.139606640186.issue21315@psf.upfronthosting.co.za>
In-reply-to

Content
Bugzilla sends e-mail in a format where =?UTF-8 is not preceded by whitespace. This makes email.headerregistry.UnstructuredHeader (and email._header_value_parser on the background) not recognise the structure. >>> import email.headerregistry, pprint >>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x) {'decoded': '[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\t' 'russian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94')])} versus >>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew: =?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text: =?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x) {'decoded': '[Bug 64155]\tNew: non-ascii bug tést;\trussian text: АБВГҐД', 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('non-ascii'), WhiteSpaceTerminal(' '), ValueTerminal('bug'), WhiteSpaceTerminal(' '), ValueTerminal('tést')]), ValueTerminal(';'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('АБВГҐД')])])} I have attached the raw e-mail as attachment. Judging by the code, this is supposed to work (while raising a Defect -- "missing whitespace before encoded word"), but the code splits by whitespace: tok, *remainder = _wsp_splitter(value, 1) which swallows the encoded section in one go. In a second attachment, I added a patch which 1) adds a test case for this and 2) implements a solution, but the solution is unfortunately not in the style of the rest of the module. In the meanwhile, I've chosen a monkey-patching approach to work around the issue: import email._header_value_parser, email.headerregistry def get_unstructured(value): value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?") return email._header_value_parser.get_unstructured(value) email.headerregistry.UnstructuredHeader.value_parser = staticmethod(get_unstructured)

Bugzilla sends e-mail in a format where =?UTF-8 is not preceded by whitespace. This makes email.headerregistry.UnstructuredHeader (and email._header_value_parser on the background) not recognise the structure.

>>> import email.headerregistry, pprint
>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x)
{'decoded': '[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\t'
            'russian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94',
 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94')])}

versus

>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew: =?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text: =?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x)
{'decoded': '[Bug 64155]\tNew:  non-ascii bug tést;\trussian text:  АБВГҐД',
 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('non-ascii'), WhiteSpaceTerminal(' '), ValueTerminal('bug'), WhiteSpaceTerminal(' '), ValueTerminal('tést')]), ValueTerminal(';'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('АБВГҐД')])])}

I have attached the raw e-mail as attachment.


Judging by the code, this is supposed to work (while raising a Defect --  "missing whitespace before encoded word"), but the code splits by whitespace:

tok, *remainder = _wsp_splitter(value, 1)

which swallows the encoded section in one go. In a second attachment, I added a patch which 1) adds a test case for this and 2) implements a solution, but the solution is unfortunately not in the style of the rest of the module.


In the meanwhile, I've chosen a monkey-patching approach to work around the issue:

import email._header_value_parser, email.headerregistry
def get_unstructured(value):
    value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?")
    return email._header_value_parser.get_unstructured(value)
email.headerregistry.UnstructuredHeader.value_parser = staticmethod(get_unstructured)

History
Date	User	Action	Args
2014-04-20 15:58:19	valhallasw	set	recipients: + valhallasw, barry, r.david.murray
2014-04-20 15:58:18	valhallasw	set	messageid: <1398009498.59.0.139606640186.issue21315@psf.upfronthosting.co.za>
2014-04-20 15:58:18	valhallasw	link	issue21315 messages
2014-04-20 15:58:17	valhallasw	create