This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: too much memory consumption in re.compile unicode
Type: resource usage Stage: resolved
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Zhipeng Xie, ezio.melotti, mrabarnett, serhiy.storchaka, zach.ware
Priority: normal Keywords: patch

Created on 2019-12-28 09:20 by Zhipeng Xie, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 17728 closed python-dev, 2019-12-28 09:29
Messages (5)
msg358936 - (view) Author: Zhipeng Xie (Zhipeng Xie) * Date: 2019-12-28 09:20
when running the following script, we found python2 comsume a lot memory while python3 does not have the issue.

import re
import time
NON_PRINTABLE = re.compile(u'[^\U00010000-\U0010ffff]')
time.sleep( 30 )

python2:
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                            
 6943 root      20   0  109956  93436   3956 S   0.0   1.2   0:00.30 python

python3:
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                            
 6952 root      20   0   28032   8880   4868 S   0.0   0.1   0:00.02 python3
msg359085 - (view) Author: Zhipeng Xie (Zhipeng Xie) * Date: 2019-12-31 01:45
Hi, I tracked it down and found that this problem was introduced in python2.7.9 by following commit:

https://hg.python.org/cpython/rev/ebd48b4f650d
msg359107 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-12-31 08:53
We usually do not backport optimizations to 2.7. It could be backported if a regression was introduced in one of 2.7 bugfixes, but range() was here before ebd48b4f650d.

Also, range(0x10000,0x10ffff+1) takes only 32*2**16 = 2 MiB of memory. It is small in comparison with total memory consumption. Obviously there are other causes of the difference between 2.7 and 3.x.
msg359109 - (view) Author: Zhipeng Xie (Zhipeng Xie) * Date: 2019-12-31 09:53
> but range() was here before ebd48b4f650d.

before ebd48b4f650d, _optimize_unicode use xrange. So python2.7.8 is ok and python2.7.9 consume much memory in my test case.

> Obviously there are other causes of the difference between 2.7 and 3.x.

Maybe it is because my python was compiled with --enable-unicode=ucs4.
msg360266 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2020-01-19 18:51
As mentioned on the attached PR, Python 2.7 has reached EOL and this can no longer be accepted.  Thanks for the report and patch anyway!
History
Date User Action Args
2022-04-11 14:59:24adminsetgithub: 83327
2020-01-19 18:51:39zach.waresetstatus: open -> closed

nosy: + zach.ware
messages: + msg360266

resolution: out of date
stage: patch review -> resolved
2020-01-04 05:18:06petdancesetnosy: + ezio.melotti, mrabarnett
components: + Regular Expressions, - Library (Lib)
2019-12-31 09:53:35Zhipeng Xiesetmessages: + msg359109
2019-12-31 08:53:35serhiy.storchakasetmessages: + msg359107
2019-12-31 01:45:17Zhipeng Xiesetnosy: + serhiy.storchaka
messages: + msg359085
2019-12-28 09:29:34python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request17172
2019-12-28 09:20:49Zhipeng Xiesettitle: to much memory consumption in re.compile unicode -> too much memory consumption in re.compile unicode
2019-12-28 09:20:29Zhipeng Xiecreate