This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: email/header.py ecre regular expression issue
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.6, Python 2.5
process
Status: closed Resolution: duplicate
Dependencies: Superseder: decode_header does not follow RFC 2047
View: 1079
Assigned To: Nosy List: ggenellina, oxij, tlynn
Priority: normal Keywords:

Created on 2009-01-15 23:33 by oxij, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg79927 - (view) Author: Jan Malakhovski (oxij) Date: 2009-01-15 23:33
Hello.

I have dedicated mail server at home
and it holds about 1G of mail.
Most of mail is in non UTF-8 codepage, so today
I wrote little script that should recode
all letters to UTF. But I found that
email.header.decode_header parses some headers wrong.

For example, header
Content-Type: application/x-msword; name="2008
=?windows-1251?B?wu7v8O7x+w==?= 2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="
parsed as
[('application/x-msword; name="2008', None),
('\xc2\xee\xef\xf0\xee\xf1\xfb', 'windows-1251'), ('2
=?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="', None)]
that is obviously wrong.

Now I'm playing with email/header.py file in
python 2.5 debian package
(but it's same in 2.6.1 version except that all <> changed to !=).

If it's patched with
==================BEGIN CUT==================
--- oldheader.py	2009-01-16 01:47:32.553130030 +0300
+++ header.py	2009-01-16 01:47:16.783119846 +0300
@@ -39,7 +39,6 @@
   \?                    # literal ?
   (?P<encoded>.*?)      # non-greedy up to the next ?= is the encoded
string
   \?=                   # literal ?=
-  (?=[ \t]|$)           # whitespace or the end of the string
   ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)
 
 # Field name regexp, including trailing colon, but not separating
whitespace,
==================END CUT==================
it works fine.

So I wonder if this
  (?=[ \t]|$)           # whitespace or the end of the string
really needed, after all if there is only
whitespaces after encoded word, its just
appended to the list by

parts = ecre.split(line)

--
Also, there is related mail list thread:
http://mail.python.org/pipermail/python-dev/2009-January/085088.html
msg79938 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-16 07:32
Your example header is invalid. Excerpt from RFC2047 <http://
www.ietf.org/rfc/rfc2047.txt> section 5:

   + An 'encoded-word' MUST NOT be used in parameter of a MIME
     Content-Type or Content-Disposition field, or in any structured
     field body except within a 'comment' or 'phrase'.

Even in the places where an "encoded word" (the sequence =?...?=) is 
allowed, it must always be surrounded by whitespace -- this is by 
design in the RFC.

If you have many of those invalid headers, you'll have to "cook" the 
output of decode_header, posibly detecting malformed sequences and 
calling decode_header again with just the offending substring. 

I don't think that Python should accept malformed headers - but if you 
come to a good solution you may publish the recipe in the Python 
cookbook <http://www.activestate.com/ASPN/Python/Cookbook/>

I'd close this report as invalid.
msg81069 - (view) Author: Tom Lynn (tlynn) Date: 2009-02-03 17:05
Duplicates issue1047.
msg81070 - (view) Author: Tom Lynn (tlynn) Date: 2009-02-03 17:06
Oops, duplicates issue 1079 even.
History
Date User Action Args
2022-04-11 14:56:44adminsetgithub: 49208
2009-03-27 20:50:57amaury.forgeotdarcsetstatus: open -> closed
resolution: duplicate
superseder: decode_header does not follow RFC 2047
2009-02-03 17:06:36tlynnsetmessages: + msg81070
2009-02-03 17:05:27tlynnsetnosy: + tlynn
messages: + msg81069
2009-01-16 07:32:07ggenellinasetnosy: + ggenellina
messages: + msg79938
2009-01-15 23:33:08oxijcreate