Message 349847 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	maxking
Recipients	barry, epicfaace, maxking, mytran, r.david.murray
Date	2019-08-16.05:55:46
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1565934946.3.0.516174976589.issue37764@roundup.psfhosted.org>
In-reply-to

Content
You have correctly identified that "=aa" is detected as a encoded word and causes the get_encoded_word to fail. However, "=?utf-8?q?somevalue?=aa" should ideally get parsed as "somevalueaa" and not "=?utf-8?q?somevalue?=aa". This is because "=?utf-8?q?somevalue?=" is a valid encoded word, it is just not followed by an empty whitespace. modified Lib/email/_header_value_parser.py @@ -1037,7 +1037,10 @@ def get_encoded_word(value): raise errors.HeaderParseError( "expected encoded word but found {}".format(value)) remstr = ''.join(remainder) - if len(remstr) > 1 and remstr[0] in hexdigits and remstr[1] in hexdigits: + if (len(remstr) > 1 and + remstr[0] in hexdigits and + remstr[1] in hexdigits and + tok.count('?') < 2): # The ? after the CTE was followed by an encoded word escape (=XX). rest, *remainder = remstr.split('?=', 1) This can be avoided by checking `?` occurs twice in the `tok`. The 2nd bug, which needs a better test case, is that if the encoded_word is invalid, you will keep running into infinite loop, which you correctly fixed in your PR. However, the test case you used is more appropriate for the first issue. You can fix both the issues, for which, you need to add a test case for 2nd issue and fix for the first issue. Looking into the PR now.

You have correctly identified that "=aa" is detected as a encoded word and causes the get_encoded_word to fail.

However, "=?utf-8?q?somevalue?=aa" should ideally get parsed as "somevalueaa" and not "=?utf-8?q?somevalue?=aa". This is because "=?utf-8?q?somevalue?=" is a valid encoded word, it is just not followed by an empty whitespace. 

modified   Lib/email/_header_value_parser.py
@@ -1037,7 +1037,10 @@ def get_encoded_word(value):
         raise errors.HeaderParseError(
             "expected encoded word but found {}".format(value))
     remstr = ''.join(remainder)
-    if len(remstr) > 1 and remstr[0] in hexdigits and remstr[1] in hexdigits:
+    if (len(remstr) > 1 and
+        remstr[0] in hexdigits and
+        remstr[1] in hexdigits and
+        tok.count('?') < 2):
         # The ? after the CTE was followed by an encoded word escape (=XX).
         rest, *remainder = remstr.split('?=', 1)

This can be avoided by checking `?` occurs twice in the `tok`.

The 2nd bug, which needs a better test case, is that if the encoded_word is invalid, you will keep running into infinite loop, which you correctly fixed in your PR. However, the test case you used is more appropriate for the first issue.

You can fix both the issues, for which, you need to add a test case for 2nd issue and fix for the first issue.

Looking into the PR now.

History
Date	User	Action	Args
2019-08-16 05:55:46	maxking	set	recipients: + maxking, barry, r.david.murray, epicfaace, mytran
2019-08-16 05:55:46	maxking	set	messageid: <1565934946.3.0.516174976589.issue37764@roundup.psfhosted.org>
2019-08-16 05:55:46	maxking	link	issue37764 messages
2019-08-16 05:55:46	maxking	create