Message 79927 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	oxij
Recipients	oxij
Date	2009-01-15.23:33:02
SpamBayes Score	0.014659063
Marked as misclassified	No
Message-id	<1232062389.84.0.560697403346.issue4958@psf.upfronthosting.co.za>
In-reply-to

Content
Hello. I have dedicated mail server at home and it holds about 1G of mail. Most of mail is in non UTF-8 codepage, so today I wrote little script that should recode all letters to UTF. But I found that email.header.decode_header parses some headers wrong. For example, header Content-Type: application/x-msword; name="2008 =?windows-1251?B?wu7v8O7x+w==?= 2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?=" parsed as [('application/x-msword; name="2008', None), ('\xc2\xee\xef\xf0\xee\xf1\xfb', 'windows-1251'), ('2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="', None)] that is obviously wrong. Now I'm playing with email/header.py file in python 2.5 debian package (but it's same in 2.6.1 version except that all <> changed to !=). If it's patched with ==================BEGIN CUT================== --- oldheader.py 2009-01-16 01:47:32.553130030 +0300 +++ header.py 2009-01-16 01:47:16.783119846 +0300 @@ -39,7 +39,6 @@ \? # literal ? (?P<encoded>.*?) # non-greedy up to the next ?= is the encoded string \?= # literal ?= - (?=[ \t]\|$) # whitespace or the end of the string ''', re.VERBOSE \| re.IGNORECASE \| re.MULTILINE) # Field name regexp, including trailing colon, but not separating whitespace, ==================END CUT================== it works fine. So I wonder if this (?=[ \t]\|$) # whitespace or the end of the string really needed, after all if there is only whitespaces after encoded word, its just appended to the list by parts = ecre.split(line) -- Also, there is related mail list thread: http://mail.python.org/pipermail/python-dev/2009-January/085088.html

Hello.

I have dedicated mail server at home
and it holds about 1G of mail.
Most of mail is in non UTF-8 codepage, so today
I wrote little script that should recode
all letters to UTF. But I found that
email.header.decode_header parses some headers wrong.

For example, header
Content-Type: application/x-msword; name="2008
=?windows-1251?B?wu7v8O7x+w==?= 2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="
parsed as
[('application/x-msword; name="2008', None),
('\xc2\xee\xef\xf0\xee\xf1\xfb', 'windows-1251'), ('2
=?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="', None)]
that is obviously wrong.

Now I'm playing with email/header.py file in
python 2.5 debian package
(but it's same in 2.6.1 version except that all <> changed to !=).

If it's patched with
==================BEGIN CUT==================
--- oldheader.py	2009-01-16 01:47:32.553130030 +0300
+++ header.py	2009-01-16 01:47:16.783119846 +0300
@@ -39,7 +39,6 @@
   \?                    # literal ?
   (?P<encoded>.*?)      # non-greedy up to the next ?= is the encoded
string
   \?=                   # literal ?=
-  (?=[ \t]|$)           # whitespace or the end of the string
   ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)
 
 # Field name regexp, including trailing colon, but not separating
whitespace,
==================END CUT==================
it works fine.

So I wonder if this
  (?=[ \t]|$)           # whitespace or the end of the string
really needed, after all if there is only
whitespaces after encoded word, its just
appended to the list by

parts = ecre.split(line)

--
Also, there is related mail list thread:
http://mail.python.org/pipermail/python-dev/2009-January/085088.html

History
Date	User	Action	Args
2009-01-15 23:33:10	oxij	set	recipients: + oxij
2009-01-15 23:33:09	oxij	set	messageid: <1232062389.84.0.560697403346.issue4958@psf.upfronthosting.co.za>
2009-01-15 23:33:07	oxij	link	issue4958 messages
2009-01-15 23:33:02	oxij	create