classification
Title: email/message.py [Message.get_content_type]: Trivial regex hangs on pathological input
Type: resource usage Stage:
Components: Library (Lib) Versions: Python 3.0, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: barry Nosy List: ajaksu2, barry, jackdied, pitrou
Priority: high Keywords: patch

Created on 2008-04-23 17:00 by ajaksu2, last changed 2008-08-16 10:31 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
message.py.patch ajaksu2, 2008-04-23 17:00
email.message.diff jackdied, 2008-08-01 04:38
Messages (6)
msg65702 - (view) Author: Daniel Diniz (ajaksu2) Date: 2008-04-23 17:00
[Reported by Alberto Casado Martín [1]]

Message.get_content_type() hangs when very large values are split by the
regex:
ctype = paramre.split(value)[0].lower().strip() #line 439

paramre comes from line 26:
paramre = re.compile(r'\s*;\s*')

Unless the full fledged parser cited in the comment before line 26 is in
the works, I suggest splitting the string by ";" to get exactly the same
behavior in a more reliable way.


[1] http://mail.python.org/pipermail/python-dev/2008-April/078840.html
msg70541 - (view) Author: Jack Diederich (jackdied) * (Python committer) Date: 2008-08-01 04:38
Augmented version of Daniel's patch.

This makes an internal function that does the same work.  It uses
txt.find() instead of split() or partition() because for pathologically
long strings find() is noticeably faster.  It also does the strip()
before the lower() which helps with evilly long strings.

I didn't remove the module global "paramre" because an external module
might be using it.  I did update its comment.

Do bugfixes get applied to 2.6 or 3.0?  I'm a bit out of practice.
msg71168 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-15 10:30
This should really be fixed. Hanging on a rather normal email message
(not a theoretical example) is not right.
msg71183 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-15 21:03
Fixed in r65700. Thanks for the report!
msg71193 - (view) Author: Jack Diederich (jackdied) * (Python committer) Date: 2008-08-16 02:52
Antoine, I looked at your patch and I'm not sure why you applied it
instead of applying mine (or saying +1 on me applying my patch).

Yours uses str.partition which I pointed out is sub-optimal (same big-Oh
but with a larger constant factor) and also adds a function that returns
two things, one of which is thrown away after having a str.strip
performed on it.

If my patch was deficient please let me know.
msg71207 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-16 10:31
Hi Jack,

> Antoine, I looked at your patch and I'm not sure why you applied it
> instead of applying mine (or saying +1 on me applying my patch).
> 
> Yours uses str.partition which I pointed out is sub-optimal (same big-Oh
> but with a larger constant factor) and also adds a function that returns
> two things, one of which is thrown away after having a str.strip
> performed on it.

I added that function so that the header splitting facility is
explicitly exposed as an internal API, as was the case with the regular
expression. I tried to mimick the behaviour of the regex as closely as
possible, which meant returning two things as well :-)

I think the point of the issue is to remove the pathological
(exponential) behaviour when parsing some headers, not to try to squeeze
out the last microseconds out of content-type parsing (which shouldn't
be, IMO, the limiting factor in email handling performance as soon as
it's not super-linear).

That said, I've timed the function against the regular expression and
the former is always faster, even for tiny strings (e.g. "a;b").

Your patch was keeping the regular expression as a module-level constant
while replacing all uses of it with a function, which I found a bit
strange (I don't think people are using paramre from the outside since
it's not documented, it's an internal not public API IMO). I also found
it strange to devote a docstring to the discussion of a performance
detail. But I don't have any strong feeling against it either, so you
can still apply it if you think it's important performance-wise.

Regards

Antoine.
History
Date User Action Args
2008-08-16 10:31:53pitrousetmessages: + msg71207
2008-08-16 02:52:43jackdiedsetmessages: + msg71193
2008-08-15 21:03:57pitrousetstatus: open -> closed
resolution: fixed
messages: + msg71183
2008-08-15 10:30:52pitrousetpriority: high
nosy: + pitrou
messages: + msg71168
versions: + Python 3.0
2008-08-01 04:38:18jackdiedsetfiles: + email.message.diff
nosy: + jackdied
messages: + msg70541
2008-04-23 20:44:02benjamin.petersonsetassignee: barry
type: resource usage
2008-04-23 17:00:15ajaksu2create