Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

email/message.py [Message.get_content_type]: Trivial regex hangs on pathological input #46928

Closed
devdanzin mannequin opened this issue Apr 23, 2008 · 6 comments
Closed
Assignees
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@devdanzin
Copy link
Mannequin

devdanzin mannequin commented Apr 23, 2008

BPO 2676
Nosy @warsaw, @pitrou, @jackdied, @devdanzin
Files
  • message.py.patch
  • email.message.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/warsaw'
    closed_at = <Date 2008-08-15.21:03:57.374>
    created_at = <Date 2008-04-23.17:00:15.651>
    labels = ['library', 'performance']
    title = 'email/message.py [Message.get_content_type]: Trivial regex hangs on pathological input'
    updated_at = <Date 2008-08-16.10:31:53.211>
    user = 'https://github.com/devdanzin'

    bugs.python.org fields:

    activity = <Date 2008-08-16.10:31:53.211>
    actor = 'pitrou'
    assignee = 'barry'
    closed = True
    closed_date = <Date 2008-08-15.21:03:57.374>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2008-04-23.17:00:15.651>
    creator = 'ajaksu2'
    dependencies = []
    files = ['10079', '11022']
    hgrepos = []
    issue_num = 2676
    keywords = ['patch']
    message_count = 6.0
    messages = ['65702', '70541', '71168', '71183', '71193', '71207']
    nosy_count = 4.0
    nosy_names = ['barry', 'pitrou', 'jackdied', 'ajaksu2']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue2676'
    versions = ['Python 2.6', 'Python 3.0']

    @devdanzin
    Copy link
    Mannequin Author

    devdanzin mannequin commented Apr 23, 2008

    [Reported by Alberto Casado Martín [1]]

    Message.get_content_type() hangs when very large values are split by the
    regex:
    ctype = paramre.split(value)[0].lower().strip() #line 439

    paramre comes from line 26:
    paramre = re.compile(r'\s*;\s*')

    Unless the full fledged parser cited in the comment before line 26 is in
    the works, I suggest splitting the string by ";" to get exactly the same
    behavior in a more reliable way.

    [1] http://mail.python.org/pipermail/python-dev/2008-April/078840.html

    @devdanzin devdanzin mannequin added the stdlib Python modules in the Lib dir label Apr 23, 2008
    @benjaminp benjaminp added the performance Performance or resource usage label Apr 23, 2008
    @jackdied
    Copy link
    Contributor

    jackdied commented Aug 1, 2008

    Augmented version of Daniel's patch.

    This makes an internal function that does the same work. It uses
    txt.find() instead of split() or partition() because for pathologically
    long strings find() is noticeably faster. It also does the strip()
    before the lower() which helps with evilly long strings.

    I didn't remove the module global "paramre" because an external module
    might be using it. I did update its comment.

    Do bugfixes get applied to 2.6 or 3.0? I'm a bit out of practice.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 15, 2008

    This should really be fixed. Hanging on a rather normal email message
    (not a theoretical example) is not right.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 15, 2008

    Fixed in r65700. Thanks for the report!

    @pitrou pitrou closed this as completed Aug 15, 2008
    @jackdied
    Copy link
    Contributor

    Antoine, I looked at your patch and I'm not sure why you applied it
    instead of applying mine (or saying +1 on me applying my patch).

    Yours uses str.partition which I pointed out is sub-optimal (same big-Oh
    but with a larger constant factor) and also adds a function that returns
    two things, one of which is thrown away after having a str.strip
    performed on it.

    If my patch was deficient please let me know.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 16, 2008

    Hi Jack,

    Antoine, I looked at your patch and I'm not sure why you applied it
    instead of applying mine (or saying +1 on me applying my patch).

    Yours uses str.partition which I pointed out is sub-optimal (same big-Oh
    but with a larger constant factor) and also adds a function that returns
    two things, one of which is thrown away after having a str.strip
    performed on it.

    I added that function so that the header splitting facility is
    explicitly exposed as an internal API, as was the case with the regular
    expression. I tried to mimick the behaviour of the regex as closely as
    possible, which meant returning two things as well :-)

    I think the point of the issue is to remove the pathological
    (exponential) behaviour when parsing some headers, not to try to squeeze
    out the last microseconds out of content-type parsing (which shouldn't
    be, IMO, the limiting factor in email handling performance as soon as
    it's not super-linear).

    That said, I've timed the function against the regular expression and
    the former is always faster, even for tiny strings (e.g. "a;b").

    Your patch was keeping the regular expression as a module-level constant
    while replacing all uses of it with a function, which I found a bit
    strange (I don't think people are using paramre from the outside since
    it's not documented, it's an internal not public API IMO). I also found
    it strange to devote a docstring to the discussion of a performance
    detail. But I don't have any strong feeling against it either, so you
    can still apply it if you think it's important performance-wise.

    Regards

    Antoine.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants