Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email Parser use 100% CPU #65647

Closed
jaderfabiano mannequin opened this issue May 6, 2014 · 30 comments
Closed

Email Parser use 100% CPU #65647

jaderfabiano mannequin opened this issue May 6, 2014 · 30 comments
Labels
performance Performance or resource usage topic-email

Comments

@jaderfabiano
Copy link
Mannequin

jaderfabiano mannequin commented May 6, 2014

BPO 21448
Nosy @warsaw, @rhettinger, @pitrou, @tiran, @bitdancer, @serhiy-storchaka
Files
  • email_parser_long_lines.patch
  • fix_email_parse.diff: Clean-up code in push()
  • fix_email_parse2.diff: Revise patch to add splitlines.
  • test_parser.diff: Extra test
  • fix_prepending2.diff: Speed-up insertion by using a deque
  • email_parser_long_lines_2.patch: Raymond's patch + tests
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-05-23.00:24:13.457>
    created_at = <Date 2014-05-06.19:20:13.387>
    labels = ['expert-email', 'performance']
    title = 'Email Parser use 100% CPU'
    updated_at = <Date 2015-05-23.00:24:13.456>
    user = 'https://bugs.python.org/jaderfabiano'

    bugs.python.org fields:

    activity = <Date 2015-05-23.00:24:13.456>
    actor = 'rhettinger'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-05-23.00:24:13.457>
    closer = 'rhettinger'
    components = ['email']
    creation = <Date 2014-05-06.19:20:13.387>
    creator = 'jader.fabiano'
    dependencies = []
    files = ['36210', '36216', '36230', '36231', '36233', '36334']
    hgrepos = []
    issue_num = 21448
    keywords = ['patch']
    message_count = 30.0
    messages = ['218008', '218010', '218011', '218012', '218013', '218014', '218015', '218016', '218123', '224516', '224562', '224591', '224610', '224623', '224624', '224631', '224647', '224649', '224667', '224863', '224866', '224894', '224910', '225155', '225158', '225228', '225229', '225233', '240670', '243873']
    nosy_count = 9.0
    nosy_names = ['barry', 'rhettinger', 'pitrou', 'christian.heimes', 'r.david.murray', 'tshepang', 'python-dev', 'serhiy.storchaka', 'jader.fabiano']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue21448'
    versions = ['Python 3.5']

    @jaderfabiano
    Copy link
    Mannequin Author

    jaderfabiano mannequin commented May 6, 2014

    Use email.parser to catch the mime's header,when a mime has attachments the process is consuming 100% of the CPU. And It can take until four minutes to finish the header parser.

    @jaderfabiano jaderfabiano mannequin added topic-email performance Performance or resource usage labels May 6, 2014
    @bitdancer
    Copy link
    Member

    Can you provide more details on how to reproduce the problem, please? For example, a sample message and the sequence of python calls you use to parse it.

    @jaderfabiano
    Copy link
    Mannequin Author

    jaderfabiano mannequin commented May 6, 2014

    I am openning a file and I am passing the File Descriptor to this function
    Parse().parse( fp ):
    This file has two attachments
    Example:
    self.fileDescriptor( file, 'rb')
    headers = Parser().parse(self.fileDescriptor )
    #Here the process starts to consume 100% of the CPU and It takes around
    four minutes to go the next line.
    print 'Headers OK'

    The File's size is 12M

    Thanks.

    2014-05-06 16:31 GMT-03:00 R. David Murray <report@bugs.python.org>:

    R. David Murray added the comment:

    Can you provide more details on how to reproduce the problem, please? For
    example, a sample message and the sequence of python calls you use to parse
    it.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue21448\>


    @jaderfabiano
    Copy link
    Mannequin Author

    jaderfabiano mannequin commented May 6, 2014

    Sorry!
    Correct line
    self.fileDescriptor = open( file, 'rb')

    2014-05-06 16:51 GMT-03:00 jader fabiano <report@bugs.python.org>:

    jader fabiano added the comment:

    I am openning a file and I am passing the File Descriptor to this function
    Parse().parse( fp ):
    This file has two attachments
    Example:
    self.fileDescriptor( file, 'rb')
    headers = Parser().parse(self.fileDescriptor )
    #Here the process starts to consume 100% of the CPU and It takes around
    four minutes to go the next line.
    print 'Headers OK'

    The File's size is 12M

    Thanks.

    2014-05-06 16:31 GMT-03:00 R. David Murray <report@bugs.python.org>:

    >
    > R. David Murray added the comment:
    >
    > Can you provide more details on how to reproduce the problem, please?
    For
    > example, a sample message and the sequence of python calls you use to
    parse
    > it.
    >
    > ----------
    >
    > _______________________________________
    > Python tracker <report@bugs.python.org>
    > <http://bugs.python.org/issue21448\>
    > _______________________________________
    >

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue21448\>


    @bitdancer
    Copy link
    Member

    We'll need the data file as well. This is going to be a data-dependent issue. With a 12MB body, I'm guessing there's some decoding pathology involved, which may or may not have been already fixed in python3.

    To confirm this you could use HeaderParser instead of Parser, which won't try to decode the body.

    @jaderfabiano
    Copy link
    Mannequin Author

    jaderfabiano mannequin commented May 6, 2014

    No, The file has 12Mb, because It has attachments. I am going to show an
    example.

    You can use a file thus:

    Date: Tue, May 10:27:17 6 -0300 (BRT)
    From: email@email.com.br
    MIME-Version: 1.0
    To: example@example.com
    Subject:example

    Content-Type: multipart/mixed; boundary=24f59adc-d522-11e3-a531-00265a0f1361

    --24f59adc-d522-11e3-a531-00265a0f1361
    Content-Type: multipart/alternative;
    boundary=24f59a28-d522-11e3-a531-00265a0f1361

    --24f59a28-d522-11e3-a531-00265a0f1361^M
    Content-Type: text/html; charset="iso-8859-1" ^M
    Content-Transfer-Encoding: 7bit

    <br/><font color="bpo-00000" face="verdana" size="3">Test example</b>

    --24f59a28-d522-11e3-a531-00265a0f1361--

    --24f59adc-d522-11e3-a531-00265a0f1361
    Content-Type: application/pdf; name=Example.pdf
    Content-Disposition: attachment; filename=Example.pdf
    Content-Transfer-Encoding: base64

    attachment content in base64......

    --24f59adc-d522-11e3-a531-00265a0f1361--

    2014-05-06 17:03 GMT-03:00 R. David Murray <report@bugs.python.org>:

    R. David Murray added the comment:

    We'll need the data file as well. This is going to be a data-dependent
    issue. With a 12MB body, I'm guessing there's some decoding pathology
    involved, which may or may not have been already fixed in python3.

    To confirm this you could use HeaderParser instead of Parser, which won't
    try to decode the body.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue21448\>


    @bitdancer
    Copy link
    Member

    Sorry, I was using RFC-speak. A message is divided into 'headers' and 'body', and all of the attachments are part of the body in RFC terms. But think of it as 'initial headers' and 'everything else'. Please either attach the full file, and/or try your test using HeaderParser and report the results.

    However, it occurs to me that the attachments aren't decoded until you retrieve them, so whatever is going on it must be something other than a decoding issue. Nevertheless, Parser actually parses the whole message, attachments included, so we'll need the actual message in order to reproduce this (unless you can reproduce it with a smaller message).

    @bitdancer
    Copy link
    Member

    Also to clarify: HeaderParser will *also* read the entire message, it just won't look for MIME attachments in the 'everything else', it will just treat the 'everything else' as arbitrary data and record it as the payload of the top level Message object.

    @jaderfabiano
    Copy link
    Mannequin Author

    jaderfabiano mannequin commented May 8, 2014

    Hi.
    I undestood this problem that It was happening,
    I was writting the mime wrong in the attachments. I read a file with size
    4M and I've converted to Base64, so I've written in the mime the content.
    But i wasn't put the lines with 76 ccharacters plus ""/r/n". I was writing
    the every in the only line. I think this did the Email Parser uses 100% of
    the CPU and It delay mora time.
    I packed up and I was sending email very fast.

    Thanks

    2014-05-06 17:25 GMT-03:00 R. David Murray <report@bugs.python.org>:

    R. David Murray added the comment:

    Also to clarify: HeaderParser will *also* read the entire message, it just
    won't look for MIME attachments in the 'everything else', it will just
    treat the 'everything else' as arbitrary data and record it as the payload
    of the top level Message object.

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue21448\>


    @serhiy-storchaka
    Copy link
    Member

    Therefore the bug is that email parser is dramatically slow for abnormal long lines. It has quadratic complexity from line size. Minimal example:

    import email.parser
    import time
    data = 'From: example@example.com\n\n' + 'x' * 10000000
    start = time.time()
    email.parser.Parser().parsestr(data)
    print(time.time() - start)

    @serhiy-storchaka
    Copy link
    Member

    Parser reads from input file small chunks (8192 churacters) and feed FeedParser which pushes data into BufferedSubFile. In BufferedSubFile.push() chunks of incomplete data are accumulated in a buffer and repeatedly scanned for newlines. Every push() has linear complexity from the size of accumulated buffer, and total complexity is quadratic.

    Here is a patch which fixes problem with parsing long lines. Feel free to add comments if they are needed (there is an abundance of comments in the module).

    @rhettinger rhettinger self-assigned this Aug 2, 2014
    @rhettinger
    Copy link
    Contributor

    I think the push() code can be a little cleaner. Attaching a revised patch that simplifies push() a bit.

    @serhiy-storchaka
    Copy link
    Member

    fix_email_parse.diff is not work when one chunk ends with '\r' and next chunk doesn't start with '\n'.

    @rhettinger
    Copy link
    Contributor

    Attaching revised patch. I forgot to reapply splitlines.

    @rhettinger
    Copy link
    Contributor

    Attaching a more extensive test

    @serhiy-storchaka
    Copy link
    Member

    fix_email_parse2.diff slightly changes behavior. See my comments on Rietveld.

    As for fix_prepending2.diff, could you please provide any benchmark results?

    And there is yet one bug in current code. str.splitlines() splits a string not only breaking it at '\r', '\n' or '\r\n', but at any Unicode line break character (e.g. '\x85', '\u2028', etc). And when a chunk ends with such line break character, it will not break a line. Definitely something should fixed: either lines should be broken only at '\r', '\n' or '\r\n', or other line break characters should be handled correctly when they happen at the end of the chunk. What would you say about this, David?

    @rhettinger
    Copy link
    Contributor

    As for fix_prepending2.diff, could you please provide
    any benchmark results>

    No. Inserting at the beginning of a list is always O(n) and inserting at the beginning of a deque is always O(1).

    @serhiy-storchaka
    Copy link
    Member

    Yes, but if n is limited, O(n) becomes O(1). In our case n is the number of fed but not read lines. I suppose the worst case is a number of empty lines, in this case n=8192. I tried following microbenchmark and did not noticed significant difference.

    $ ./python -m timeit -s "from email.parser import Parser; d = 'From: example@example.com\n\n' + '\n' * 100000" -- "Parser().parsestr(d)"

    @rhettinger
    Copy link
    Contributor

    A deque is typically the right data structure when you need to append, pop, and extend on both the left and right side. It is designed specifically for that task. Also, it nicely cleans-up the code by removing the backwards line list and the list reversal prior to insertion on the left (that's what we had to do to achieve decent performance before the introduction of deques in Python 2.4, now you hardly ever see code like "self._lines[:0] = lines[::-1]"). I think fix_prepending2 would be a nice improvement for Py3.5.

    For the main patches that directly address the OP's performance issue, feel free to apply either my or yours. They both work. Either way, please add test_parser.diff since the original test didn't cover all the cases and because it didn't make clear the relationship between push() and splitlines().

    @bitdancer
    Copy link
    Member

    Serhiy: there was an issue with /r/n going across a chunk boundary that was fixed a while back, so there should be a test for that (I hope).

    As for how to handle line breaks, backward compatibility applies: we have to continue to do what we did before, and it doesn't look like this patch changes that. That is, it sounds like you are saying there is a pre-existing bug that we may want to address? In which case it should presumably be a separate issue.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 5, 2014

    Should this be categorized as a security issue? You could easily DoS a server with that (email.parser is used by http.client to parse HTTP headers, it seems).

    @rhettinger
    Copy link
    Contributor

    Should this be categorized as a security issue?
    You could easily DoS a server with that
    (email.parser is used by http.client to parse HTTP
    headers, it seems).

    I think it makes sense to treat this as a security issue.

    I don't have a preference about whether to use Serhiy's email_parser_long_lines.patch or my fix_email_parse2.diff
    but we should include the extra tests in test_parser.diff.

    @serhiy-storchaka
    Copy link
    Member

    I found a bug in my patch. Following code

    from email.parser import Parser
    BLOCKSIZE = 8192
    s = 'From: <e@example.com>\nFoo: '
    s += 'x' * ((-len(s) - 1) % BLOCKSIZE) + '\rBar: '
    s += 'y' * ((-len(s) - 1) % BLOCKSIZE) + '\x85Baz: '
    s += 'z' * ((-len(s) - 1) % BLOCKSIZE) + '\n\n'
    print(Parser().parsestr(s).keys())

    outputs ['From', 'Foo', 'Bar', 'Baz'] on current code and ['From', 'Foo', 'Bar'] with my patch. Neither current code, nor Reimonds patch are not affected by similar bugs. It is possible to fix my patch, but then it will become too complicated and slower.

    I have one doubt about one special case with Raymond's patch, but looking at current code on highter level, this doesn't matter. Current code in FeedParser in any case is not very efficient and smoothes out any implementation details in BufferedSubFile. That is why fix_prepending2.diff has no visible effect on email parsing.

    I'll provided additional tests which cover current issue and a bug in my patch.

    That is, it sounds like you are saying there is a pre-existing bug that we may want to address? In which case it should presumably be a separate issue.

    I can't create an example. May be higher level code is tolerant to it. I'll created separate issue if found an example.

    Should this be categorized as a security issue?

    Yes, but not very important. You need send tens or hundreds of megabytes to hang a server more than a second.

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch which combines fixed Raymond's patch and FeedParser tests. These tests cover this issue, a bug in my patch, and (surprisingly) a bug in Raymond's patch. I didn't include Raymond's test because looks as it doesn't catch any bug. If there are no objections, I'll commit this patch.

    @rhettinger
    Copy link
    Contributor

    The test_parser.diff file catches the bug in fix_email_parse.diff and it provides some assurance that push() functions as an incremental version of str.splitlines().

    I would like to have this test included. It does some good and does no harm.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 12, 2014

    New changeset ba90bd01c5f1 by Serhiy Storchaka in branch '2.7':
    Issue bpo-21448: Fixed FeedParser feed() to avoid O(N**2) behavior when parsing long line.
    http://hg.python.org/cpython/rev/ba90bd01c5f1

    New changeset 1b1f92e39462 by Serhiy Storchaka in branch '3.4':
    Issue bpo-21448: Fixed FeedParser feed() to avoid O(N**2) behavior when parsing long line.
    http://hg.python.org/cpython/rev/1b1f92e39462

    New changeset f296d7d82675 by Serhiy Storchaka in branch 'default':
    Issue bpo-21448: Fixed FeedParser feed() to avoid O(N**2) behavior when parsing long line.
    http://hg.python.org/cpython/rev/f296d7d82675

    @serhiy-storchaka
    Copy link
    Member

    The test_parser.diff file catches the bug in fix_email_parse.diff

    I don't see this. But well, it does no harm.

    Please commit fix_prepending2.diff yourself.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 12, 2014

    New changeset 71cb8f605f77 by Serhiy Storchaka in branch '2.7':
    Decreased memory requirements of new tests added in bpo-21448.
    http://hg.python.org/cpython/rev/71cb8f605f77

    New changeset c19d3465965f by Serhiy Storchaka in branch '3.4':
    Decreased memory requirements of new tests added in bpo-21448.
    http://hg.python.org/cpython/rev/c19d3465965f

    New changeset f07b17de3b0d by Serhiy Storchaka in branch 'default':
    Decreased memory requirements of new tests added in bpo-21448.
    http://hg.python.org/cpython/rev/f07b17de3b0d

    @bitdancer
    Copy link
    Member

    Raymond, are you gong to apply the deque patch (maybe after doing performance measurement) or should we close this?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 23, 2015

    New changeset 830bcf4fb29b by Raymond Hettinger in branch 'default':
    Issue bpo-21448: Improve performance of the email feedparser
    https://hg.python.org/cpython/rev/830bcf4fb29b

    @rhettinger rhettinger removed their assignment May 23, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage topic-email
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants