Issue 4958: email/header.py ecre regular expression issue

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49208

classification

Title:	email/header.py ecre regular expression issue
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 2.6, Python 2.5

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	decode_header does not follow RFC 2047 View: 1079
Assigned To:		Nosy List:	ggenellina, oxij, tlynn
Priority:	normal	Keywords:

Created on 2009-01-15 23:33 by oxij, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg79927 - (view)	Author: Jan Malakhovski (oxij)	Date: 2009-01-15 23:33
Hello. I have dedicated mail server at home and it holds about 1G of mail. Most of mail is in non UTF-8 codepage, so today I wrote little script that should recode all letters to UTF. But I found that email.header.decode_header parses some headers wrong. For example, header Content-Type: application/x-msword; name="2008 =?windows-1251?B?wu7v8O7x+w==?= 2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?=" parsed as [('application/x-msword; name="2008', None), ('\xc2\xee\xef\xf0\xee\xf1\xfb', 'windows-1251'), ('2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="', None)] that is obviously wrong. Now I'm playing with email/header.py file in python 2.5 debian package (but it's same in 2.6.1 version except that all <> changed to !=). If it's patched with ==================BEGIN CUT================== --- oldheader.py 2009-01-16 01:47:32.553130030 +0300 +++ header.py 2009-01-16 01:47:16.783119846 +0300 @@ -39,7 +39,6 @@ \? # literal ? (?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string \?= # literal ?= - (?=[ \t]\|$) # whitespace or the end of the string ''', re.VERBOSE \| re.IGNORECASE \| re.MULTILINE) # Field name regexp, including trailing colon, but not separating whitespace, ==================END CUT================== it works fine. So I wonder if this (?=[ \t]\|$) # whitespace or the end of the string really needed, after all if there is only whitespaces after encoded word, its just appended to the list by parts = ecre.split(line) -- Also, there is related mail list thread: http://mail.python.org/pipermail/python-dev/2009-January/085088.html
msg79938 - (view)	Author: Gabriel Genellina (ggenellina)	Date: 2009-01-16 07:32
Your example header is invalid. Excerpt from RFC2047 <http:// www.ietf.org/rfc/rfc2047.txt> section 5: + An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or Content-Disposition field, or in any structured field body except within a 'comment' or 'phrase'. Even in the places where an "encoded word" (the sequence =?...?=) is allowed, it must always be surrounded by whitespace -- this is by design in the RFC. If you have many of those invalid headers, you'll have to "cook" the output of decode_header, posibly detecting malformed sequences and calling decode_header again with just the offending substring. I don't think that Python should accept malformed headers - but if you come to a good solution you may publish the recipe in the Python cookbook <http://www.activestate.com/ASPN/Python/Cookbook/> I'd close this report as invalid.
msg81069 - (view)	Author: Tom Lynn (tlynn)	Date: 2009-02-03 17:05
Duplicates issue1047.
msg81070 - (view)	Author: Tom Lynn (tlynn)	Date: 2009-02-03 17:06
Oops, duplicates issue 1079 even.

History
Date	User	Action	Args
2022-04-11 14:56:44	admin	set	github: 49208
2009-03-27 20:50:57	amaury.forgeotdarc	set	status: open -> closed resolution: duplicate superseder: decode_header does not follow RFC 2047
2009-02-03 17:06:36	tlynn	set	messages: + msg81070
2009-02-03 17:05:27	tlynn	set	nosy: + tlynn messages: + msg81069
2009-01-16 07:32:07	ggenellina	set	nosy: + ggenellina messages: + msg79938
2009-01-15 23:33:08	oxij	create