Pack PyASCIIObject fields to reduce memory consumption of pure ASCII strings #58630

vstinner · 2012-03-27T11:14:18Z

BPO	14422
Nosy	@loewis, @jcea, @pitrou, @vstinner, @bitdancer, @serhiy-storchaka
Files	pack_pyasciiobject.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-03-30.21:36:51.293>
created_at = <Date 2012-03-27.11:14:17.742>
labels = ['interpreter-core', 'type-feature']
title = 'Pack PyASCIIObject fields to reduce memory consumption of pure ASCII strings'
updated_at = <Date 2012-03-30.21:36:51.291>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2012-03-30.21:36:51.291>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2012-03-30.21:36:51.293>
closer = 'vstinner'
components = ['Interpreter Core']
creation = <Date 2012-03-27.11:14:17.742>
creator = 'vstinner'
dependencies = []
files = ['25037']
hgrepos = []
issue_num = 14422
keywords = ['patch']
message_count = 6.0
messages = ['156905', '156908', '156910', '156930', '157149', '157150']
nosy_count = 6.0
nosy_names = ['loewis', 'jcea', 'pitrou', 'vstinner', 'r.david.murray', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue14422'
versions = ['Python 3.3']

vstinner · 2012-03-27T11:14:16Z

It is possible to reduce PyASCIIObject.state to 8 bits instead of 32, move it to the end (exchange wstr and state) of the structure and pack the structure. As a result, the structure size is reduced by 3 bytes (state type changes from int to char).

I expect a low or not overhead on performances because only PyASCIIObject.state field is affected and this field size is 8 bits.

See also the issue bpo-14419 which relies on memory alignment (of the ASCII string data) to optimize the ASCII decoder. If I understand correctly, my patch disables the possibility of this optimization.

--

Example on Linux 32 bits:

$ cat x.c 
#include <Python.h>

int main()
{
    printf("sizeof(PyASCIIObject)=%u bytes\n", sizeof(PyASCIIObject));
    printf("sizeof(PyCompactUnicodeObject)=%u bytes\n", sizeof(PyCompactUnicodeObject));
    printf("sizeof(PyUnicodeObject)=%u bytes\n", sizeof(PyUnicodeObject));
    return 0;
}

# unpatched
$ gcc -I Include/ -I . x.c -o x && ./x
sizeof(PyASCIIObject)=24 bytes
sizeof(PyCompactUnicodeObject)=36 bytes
sizeof(PyUnicodeObject)=40 bytes

# pack the 3 structures
$ gcc -I Include/ -I . x.c -o x && ./x
sizeof(PyASCIIObject)=21 bytes
sizeof(PyCompactUnicodeObject)=33 bytes
sizeof(PyUnicodeObject)=37 bytes

--

We might also pack PyCompactUnicodeObject and PyUnicodeObject but it would have a bad impact on performances because utf8_length, utf8, wstr_length and data would not be aligned anymore.

vstinner · 2012-03-27T11:23:01Z

iobench and stringbench results on unpatched Python:

$ ./python Tools/iobench/iobench.py -t
Preparing files...
Python 3.3.0a1+ (default:51016ff7f8c9, Mar 27 2012, 13:19:52) 
[GCC 4.6.1]
Unicode: PEP 393
Linux-3.0.0-16-generic-pae-i686-with-debian-wheezy-sid
Text unit = one character (utf8-decoded)

** Text input **

[ 400KB ] read one unit at a time... 5.4 MB/s
[ 400KB ] read 20 units at a time... 68 MB/s
[ 400KB ] read one line at a time... 174 MB/s
[ 400KB ] read 4096 units at a time... 289 MB/s

[ 20KB ] read whole contents at once... 315 MB/s
[ 400KB ] read whole contents at once... 332 MB/s
[ 10MB ] read whole contents at once... 292 MB/s

[ 400KB ] seek forward one unit at a time... 0.304 MB/s
[ 400KB ] seek forward 1000 units at a time... 312 MB/s

** Text append **

[ 20KB ] write one unit at a time... 3.05 MB/s
[ 400KB ] write 20 units at a time... 43 MB/s
[ 400KB ] write 4096 units at a time... 554 MB/s
[ 10MB ] write 1e6 units at a time... 450 MB/s

** Text overwrite **

[ 20KB ] modify one unit at a time... 1.18 MB/s
[ 400KB ] modify 20 units at a time... 18.9 MB/s
[ 400KB ] modify 4096 units at a time... 400 MB/s

$ ./python stringbench/stringbench.py 
stringbench v2.0
3.3.0a1+ (default:51016ff7f8c9, Mar 27 2012, 13:19:52) 
[GCC 4.6.1]
2012-03-27 13:21:01.217823
bytes   unicode
(in ms) (in ms) %       comment
========== case conversion -- dense
0.37    0.38    97.9    ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower() (*1000)
0.38    0.38    99.3    ("where in the world is carmen san deigo?"*10).upper() (*1000)
========== case conversion -- rare
0.38    0.38    99.9    ("Where in the world is Carmen San Deigo?"*10).lower() (*1000)
0.43    0.38    113.6   ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper() (*1000)
========== concat 20 strings of words length 4 to 15
1.76    1.69    104.2   s1+s2+s3+s4+...+s20 (*1000)
========== concat two strings
0.08    0.07    107.7   "Andrew"+"Dalke" (*1000)
========== count AACT substrings in DNA example
2.15    2.13    100.7   dna.count("AACT") (*10)
========== count newlines
0.65    0.58    110.8   ...text.with.2000.newlines.count("\n") (*10)
========== early match, single character
0.20    0.19    107.9   ("A"*1000).find("A") (*1000)
0.36    0.05    745.8   "A" in "A"*1000 (*1000)
0.18    0.19    96.4    ("A"*1000).index("A") (*1000)
0.18    0.21    85.5    ("A"*1000).partition("A") (*1000)
0.21    0.20    103.6   ("A"*1000).rfind("A") (*1000)
0.21    0.30    69.8    ("A"*1000).rindex("A") (*1000)
0.37    0.21    171.7   ("A"*1000).rpartition("A") (*1000)
0.38    0.39    98.4    ("A"*1000).rsplit("A", 1) (*1000)
0.37    0.37    100.7   ("A"*1000).split("A", 1) (*1000)
========== early match, two characters
0.20    0.19    107.7   ("AB"*1000).find("AB") (*1000)
0.36    0.05    702.1   "AB" in "AB"*1000 (*1000)
0.18    0.19    96.9    ("AB"*1000).index("AB") (*1000)
0.20    0.24    83.9    ("AB"*1000).partition("AB") (*1000)
0.20    0.20    103.6   ("AB"*1000).rfind("AB") (*1000)
0.20    0.19    102.9   ("AB"*1000).rindex("AB") (*1000)
0.20    0.23    86.7    ("AB"*1000).rpartition("AB") (*1000)
0.39    0.40    97.7    ("AB"*1000).rsplit("AB", 1) (*1000)
0.40    0.42    94.4    ("AB"*1000).split("AB", 1) (*1000)
========== endswith multiple characters
0.17    0.19    92.6    "Andrew".endswith("Andrew") (*1000)
========== endswith multiple characters - not!
0.17    0.18    95.2    "Andrew".endswith("Anders") (*1000)
========== endswith single character
0.17    0.18    92.3    "Andrew".endswith("w") (*1000)
========== formatting a string type with a dict
N/A     0.91    0.0     "The %(k1)s is %(k2)s the %(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
========== join empty string, with 1 character sep
N/A     0.04    0.0     "A".join("") (*100)
========== join empty string, with 5 character sep
N/A     0.04    0.0     "ABCDE".join("") (*100)
========== join list of 100 words, with 1 character sep
1.37    1.71    80.0    "A".join(["Bob"]*100)) (*1000)
========== join list of 100 words, with 5 character sep
1.50    1.86    80.8    "ABCDE".join(["Bob"]*100)) (*1000)
========== join list of 26 characters, with 1 character sep
0.48    0.49    99.6    "A".join(list("ABC..Z")) (*1000)
========== join list of 26 characters, with 5 character sep
0.49    0.54    91.3    "ABCDE".join(list("ABC..Z")) (*1000)
========== join string with 26 characters, with 1 character sep
N/A     1.17    0.0     "A".join("ABC..Z") (*1000)
========== join string with 26 characters, with 5 character sep
N/A     1.22    0.0     "ABCDE".join("ABC..Z") (*1000)
========== late match, 100 characters
8.48    8.46    100.2   s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
4.19    3.50    119.9   s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
5.30    5.11    103.7   s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
8.47    8.45    100.2   s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
8.68    8.68    100.0   s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
6.36    6.37    99.8    s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
2.33    2.27    102.4   s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
6.58    6.58    100.1   s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
7.34    6.56    111.9   s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
6.69    7.65    87.5    s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
8.47    8.87    95.4    s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
========== late match, two characters
1.30    1.26    102.7   ("AB"*300+"C").find("BC") (*1000)
1.30    1.27    102.0   ("AB"*300+"CA").find("CA") (*1000)
1.42    1.10    129.6   "BC" in ("AB"*300+"C") (*1000)
1.20    1.20    100.2   ("AB"*300+"C").index("BC") (*1000)
1.16    1.26    92.3    ("AB"*300+"C").partition("BC") (*1000)
0.95    0.94    101.0   ("C"+"AB"*300).rfind("CA") (*1000)
0.90    0.69    131.2   ("BC"+"AB"*300).rfind("BC") (*1000)
0.94    0.94    100.1   ("C"+"AB"*300).rindex("CA") (*1000)
1.02    0.94    108.6   ("C"+"AB"*300).rpartition("CA") (*1000)
1.12    1.08    103.7   ("C"+"AB"*300).rsplit("CA", 1) (*1000)
1.27    1.38    91.8    ("AB"*300+"C").split("BC", 1) (*1000)
========== no match, single character
0.45    0.41    111.1   ("A"*1000).find("B") (*1000)
0.59    0.29    205.4   "B" in "A"*1000 (*1000)
0.30    0.31    97.4    ("A"*1000).partition("B") (*1000)
0.49    0.48    102.5   ("A"*1000).rfind("B") (*1000)
0.36    0.37    96.5    ("A"*1000).rpartition("B") (*1000)
0.77    0.76    101.4   ("A"*1000).rsplit("B", 1) (*1000)
0.83    0.81    101.6   ("A"*1000).split("B", 1) (*1000)
========== no match, two characters
3.80    3.78    100.6   ("AB"*1000).find("BC") (*1000)
4.08    3.68    111.0   ("AB"*1000).find("CA") (*1000)
3.71    3.40    109.2   "BC" in "AB"*1000 (*1000)
3.44    3.42    100.8   ("AB"*1000).partition("BC") (*1000)
2.56    1.86    137.9   ("AB"*1000).rfind("BC") (*1000)
2.69    2.69    100.2   ("AB"*1000).rfind("CA") (*1000)
2.50    1.84    135.6   ("AB"*1000).rpartition("BC") (*1000)
2.03    1.94    104.7   ("AB"*1000).rsplit("BC", 1) (*1000)
3.27    3.56    91.8    ("AB"*1000).split("BC", 1) (*1000)
========== quick replace multiple character match
0.08    0.08    99.7    ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
========== quick replace single character match
0.08    0.09    89.5    ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
========== repeat 1 character 10 times
0.06    0.07    87.0    "A"*10 (*1000)
========== repeat 1 character 1000 times
0.13    0.15    89.3    "A"*1000 (*1000)
========== repeat 5 characters 10 times
0.12    0.09    128.8   "ABCDE"*10 (*1000)
========== repeat 5 characters 1000 times
0.33    0.34    94.8    "ABCDE"*1000 (*1000)
========== replace and expand multiple characters, big string
1.83    2.11    86.4    "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
========== replace multiple characters, dna
3.21    3.23    99.5    dna.replace("ATC", "ATT") (*10)
========== replace single character
0.18    0.25    70.9    "This is a test".replace(" ", "\t") (*1000)
========== replace single character, big string
0.65    0.92    70.1    "...text.with.2000.lines...replace("\n", " ") (*10)
========== replace/remove multiple characters
0.27    0.34    78.7    "When shall we three meet again?".replace("ee", "") (*1000)
========== split 1 whitespace
0.12    0.14    82.7    ("Here are some words. "*2).partition(" ") (*1000)
0.08    0.11    75.9    ("Here are some words. "*2).rpartition(" ") (*1000)
0.23    0.26    87.4    ("Here are some words. "*2).rsplit(None, 1) (*1000)
0.24    0.25    95.9    ("Here are some words. "*2).split(None, 1) (*1000)
========== split 2000 newlines
1.59    1.75    90.8    "...text...".rsplit("\n") (*10)
1.64    1.68    97.5    "...text...".split("\n") (*10)
1.83    2.03    90.1    "...text...".splitlines() (*10)
========== split newlines
0.26    0.29    88.8    "this\nis\na\ntest\n".rsplit("\n") (*1000)
0.27    0.29    92.2    "this\nis\na\ntest\n".split("\n") (*1000)
0.26    0.30    85.8    "this\nis\na\ntest\n".splitlines() (*1000)
========== split on multicharacter separator (dna)
2.18    1.86    117.5   dna.rsplit("ACTAT") (*10)
2.53    2.48    102.0   dna.split("ACTAT") (*10)
========== split on multicharacter separator (small)
0.53    0.59    88.8    "this--is--a--test--of--the--emergency--broadcast--system".rsplit("--") (*1000)
0.59    0.57    102.6   "this--is--a--test--of--the--emergency--broadcast--system".split("--") (*1000)
========== split whitespace (huge)
1.50    1.73    86.9    human_text.rsplit() (*10)
1.49    1.75    85.5    human_text.split() (*10)
========== split whitespace (small)
0.43    0.50    87.0    ("Here are some words. "*2).rsplit() (*1000)
0.40    0.50    79.4    ("Here are some words. "*2).split() (*1000)
========== startswith multiple characters
0.17    0.18    92.0    "Andrew".startswith("Andrew") (*1000)
========== startswith multiple characters - not!
0.17    0.17    99.5    "Andrew".startswith("Anders") (*1000)
========== startswith single character
0.17    0.18    94.0    "Andrew".startswith("A") (*1000)
========== strip terminal newline
0.07    0.15    46.9    s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
0.06    0.07    78.1    "\nHello!".rstrip() (*1000)
0.05    0.13    42.1    "Hello!\n".rstrip() (*1000)
0.06    0.07    77.1    "\nHello!\n".strip() (*1000)
0.06    0.07    77.6    "\nHello!".strip() (*1000)
0.05    0.07    75.0    "Hello!\n".strip() (*1000)
========== strip terminal spaces and tabs
0.06    0.08    74.2    "\t   \tHello".rstrip() (*1000)
0.06    0.07    79.4    "Hello\t   \t".rstrip() (*1000)
0.04    0.05    87.1    "Hello\t   \t".strip() (*1000)
========== tab split
0.44    0.51    87.2    GFF3_example.rsplit("\t", 8) (*1000)
0.42    0.47    89.9    GFF3_example.rsplit("\t") (*1000)
0.39    0.44    88.7    GFF3_example.split("\t", 8) (*1000)
0.41    0.47    86.1    GFF3_example.split("\t") (*1000)
158.46  160.84  98.5    TOTAL

iobench and stringbench results on patched Python (pack the 3 structures):

$ ./python Tools/iobench/iobench.py -t
Preparing files...
Python 3.3.0a1+ (default:51016ff7f8c9+, Mar 27 2012, 13:11:28) 
[GCC 4.6.1]
Unicode: PEP 393
Linux-3.0.0-16-generic-pae-i686-with-debian-wheezy-sid
Text unit = one character (utf8-decoded)

** Text input **

[ 400KB ] read one unit at a time... 5.4 MB/s
[ 400KB ] read 20 units at a time... 68.5 MB/s
[ 400KB ] read one line at a time... 163 MB/s
[ 400KB ] read 4096 units at a time... 295 MB/s

[ 20KB ] read whole contents at once... 322 MB/s
[ 400KB ] read whole contents at once... 336 MB/s
[ 10MB ] read whole contents at once... 289 MB/s

[ 400KB ] seek forward one unit at a time... 0.32 MB/s
[ 400KB ] seek forward 1000 units at a time... 325 MB/s

** Text append **

[ 20KB ] write one unit at a time... 2.99 MB/s
[ 400KB ] write 20 units at a time... 44 MB/s
[ 400KB ] write 4096 units at a time... 556 MB/s
[ 10MB ] write 1e6 units at a time... 456 MB/s

** Text overwrite **

[ 20KB ] modify one unit at a time... 1.16 MB/s
[ 400KB ] modify 20 units at a time... 19.5 MB/s
[ 400KB ] modify 4096 units at a time... 401 MB/s

$ ./python stringbench/stringbench.py 
stringbench v2.0
3.3.0a1+ (default:51016ff7f8c9+, Mar 27 2012, 13:11:28) 
[GCC 4.6.1]
2012-03-27 13:17:42.363789
bytes   unicode
(in ms) (in ms) %       comment
========== case conversion -- dense
0.37    0.38    98.6    ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower() (*1000)
0.37    0.38    98.4    ("where in the world is carmen san deigo?"*10).upper() (*1000)
========== case conversion -- rare
0.37    0.38    98.6    ("Where in the world is Carmen San Deigo?"*10).lower() (*1000)
0.37    0.38    98.4    ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper() (*1000)
========== concat 20 strings of words length 4 to 15
1.86    1.85    100.9   s1+s2+s3+s4+...+s20 (*1000)
========== concat two strings
0.08    0.07    108.0   "Andrew"+"Dalke" (*1000)
========== count AACT substrings in DNA example
2.16    2.12    101.8   dna.count("AACT") (*10)
========== count newlines
0.59    0.58    101.3   ...text.with.2000.newlines.count("\n") (*10)
========== early match, single character
0.18    0.17    103.7   ("A"*1000).find("A") (*1000)
0.36    0.05    775.5   "A" in "A"*1000 (*1000)
0.17    0.17    102.0   ("A"*1000).index("A") (*1000)
0.17    0.20    84.7    ("A"*1000).partition("A") (*1000)
0.19    0.19    102.2   ("A"*1000).rfind("A") (*1000)
0.19    0.38    50.7    ("A"*1000).rindex("A") (*1000)
0.18    0.20    90.0    ("A"*1000).rpartition("A") (*1000)
0.59    0.36    166.9   ("A"*1000).rsplit("A", 1) (*1000)
0.34    0.36    93.5    ("A"*1000).split("A", 1) (*1000)
========== early match, two characters
0.18    0.19    95.8    ("AB"*1000).find("AB") (*1000)
0.44    0.05    891.0   "AB" in "AB"*1000 (*1000)
0.23    0.31    73.4    ("AB"*1000).index("AB") (*1000)
0.22    0.31    70.7    ("AB"*1000).partition("AB") (*1000)
0.19    0.19    101.2   ("AB"*1000).rfind("AB") (*1000)
0.19    0.19    102.0   ("AB"*1000).rindex("AB") (*1000)
0.17    0.21    78.7    ("AB"*1000).rpartition("AB") (*1000)
0.35    0.38    93.0    ("AB"*1000).rsplit("AB", 1) (*1000)
0.39    0.42    93.0    ("AB"*1000).split("AB", 1) (*1000)
========== endswith multiple characters
0.16    0.17    93.0    "Andrew".endswith("Andrew") (*1000)
========== endswith multiple characters - not!
0.16    0.16    101.4   "Andrew".endswith("Anders") (*1000)
========== endswith single character
0.16    0.17    93.7    "Andrew".endswith("w") (*1000)
========== formatting a string type with a dict
N/A     0.86    0.0     "The %(k1)s is %(k2)s the %(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
========== join empty string, with 1 character sep
N/A     0.04    0.0     "A".join("") (*100)
========== join empty string, with 5 character sep
N/A     0.04    0.0     "ABCDE".join("") (*100)
========== join list of 100 words, with 1 character sep
1.42    1.74    81.3    "A".join(["Bob"]*100)) (*1000)
========== join list of 100 words, with 5 character sep
1.62    1.95    83.3    "ABCDE".join(["Bob"]*100)) (*1000)
========== join list of 26 characters, with 1 character sep
0.51    0.57    89.7    "A".join(list("ABC..Z")) (*1000)
========== join list of 26 characters, with 5 character sep
0.58    0.53    108.1   "ABCDE".join(list("ABC..Z")) (*1000)
========== join string with 26 characters, with 1 character sep
N/A     1.30    0.0     "A".join("ABC..Z") (*1000)
========== join string with 26 characters, with 5 character sep
N/A     1.22    0.0     "ABCDE".join("ABC..Z") (*1000)
========== late match, 100 characters
8.50    8.45    100.6   s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
3.70    3.46    107.0   s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
5.11    5.08    100.6   s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
8.62    8.47    101.7   s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
8.80    8.67    101.5   s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
6.39    6.46    99.0    s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
2.31    2.18    105.9   s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
6.41    6.35    100.9   s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
7.41    6.56    112.9   s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
6.59    6.59    100.0   s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
8.00    8.69    92.0    s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
========== late match, two characters
1.20    1.21    99.6    ("AB"*300+"C").find("BC") (*1000)
1.29    1.25    103.1   ("AB"*300+"CA").find("CA") (*1000)
1.41    1.07    130.9   "BC" in ("AB"*300+"C") (*1000)
1.20    1.21    99.3    ("AB"*300+"C").index("BC") (*1000)
1.17    1.20    97.5    ("AB"*300+"C").partition("BC") (*1000)
0.95    0.93    101.4   ("C"+"AB"*300).rfind("CA") (*1000)
0.90    0.69    129.3   ("BC"+"AB"*300).rfind("BC") (*1000)
0.95    0.94    101.2   ("C"+"AB"*300).rindex("CA") (*1000)
1.01    0.94    106.8   ("C"+"AB"*300).rpartition("CA") (*1000)
1.11    1.10    101.5   ("C"+"AB"*300).rsplit("CA", 1) (*1000)
1.28    1.37    93.6    ("AB"*300+"C").split("BC", 1) (*1000)
========== no match, single character
0.41    0.40    101.2   ("A"*1000).find("B") (*1000)
0.59    0.29    203.8   "B" in "A"*1000 (*1000)
0.29    0.30    95.7    ("A"*1000).partition("B") (*1000)
0.49    0.48    101.4   ("A"*1000).rfind("B") (*1000)
0.37    0.38    97.3    ("A"*1000).rpartition("B") (*1000)
0.76    0.75    101.1   ("A"*1000).rsplit("B", 1) (*1000)
0.76    0.75    100.9   ("A"*1000).split("B", 1) (*1000)
========== no match, two characters
3.53    3.52    100.2   ("AB"*1000).find("BC") (*1000)
3.92    3.67    106.9   ("AB"*1000).find("CA") (*1000)
3.71    3.39    109.6   "BC" in "AB"*1000 (*1000)
3.40    3.42    99.5    ("AB"*1000).partition("BC") (*1000)
2.55    1.90    134.2   ("AB"*1000).rfind("BC") (*1000)
2.69    2.68    100.1   ("AB"*1000).rfind("CA") (*1000)
2.43    1.81    133.9   ("AB"*1000).rpartition("BC") (*1000)
2.02    1.92    104.8   ("AB"*1000).rsplit("BC", 1) (*1000)
3.27    3.54    92.4    ("AB"*1000).split("BC", 1) (*1000)
========== quick replace multiple character match
0.09    0.08    107.7   ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
========== quick replace single character match
0.09    0.08    108.7   ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
========== repeat 1 character 10 times
0.06    0.07    87.5    "A"*10 (*1000)
========== repeat 1 character 1000 times
0.16    0.12    135.0   "A"*1000 (*1000)
========== repeat 5 characters 10 times
0.11    0.10    104.9   "ABCDE"*10 (*1000)
========== repeat 5 characters 1000 times
0.35    0.37    93.7    "ABCDE"*1000 (*1000)
========== replace and expand multiple characters, big string
1.78    2.04    87.3    "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
========== replace multiple characters, dna
3.20    3.25    98.5    dna.replace("ATC", "ATT") (*10)
========== replace single character
0.17    0.24    73.0    "This is a test".replace(" ", "\t") (*1000)
========== replace single character, big string
0.62    0.88    69.7    "...text.with.2000.lines...replace("\n", " ") (*10)
========== replace/remove multiple characters
0.25    0.32    78.3    "When shall we three meet again?".replace("ee", "") (*1000)
========== split 1 whitespace
0.10    0.13    78.9    ("Here are some words. "*2).partition(" ") (*1000)
0.08    0.11    76.8    ("Here are some words. "*2).rpartition(" ") (*1000)
0.23    0.25    91.7    ("Here are some words. "*2).rsplit(None, 1) (*1000)
0.23    0.26    87.1    ("Here are some words. "*2).split(None, 1) (*1000)
========== split 2000 newlines
1.60    1.75    91.7    "...text...".rsplit("\n") (*10)
1.56    1.65    94.3    "...text...".split("\n") (*10)
1.78    2.04    87.0    "...text...".splitlines() (*10)
========== split newlines
0.27    0.29    92.6    "this\nis\na\ntest\n".rsplit("\n") (*1000)
0.27    0.29    94.2    "this\nis\na\ntest\n".split("\n") (*1000)
0.26    0.29    90.4    "this\nis\na\ntest\n".splitlines() (*1000)
========== split on multicharacter separator (dna)
2.09    1.92    108.5   dna.rsplit("ACTAT") (*10)
2.56    2.64    96.9    dna.split("ACTAT") (*10)
========== split on multicharacter separator (small)
0.72    0.89    81.1    "this--is--a--test--of--the--emergency--broadcast--system".rsplit("--") (*1000)
0.75    0.65    114.5   "this--is--a--test--of--the--emergency--broadcast--system".split("--") (*1000)
========== split whitespace (huge)
1.50    1.73    86.3    human_text.rsplit() (*10)
2.25    2.68    83.8    human_text.split() (*10)
========== split whitespace (small)
0.42    0.51    82.0    ("Here are some words. "*2).rsplit() (*1000)
0.41    0.48    86.7    ("Here are some words. "*2).split() (*1000)
========== startswith multiple characters
0.16    0.18    88.9    "Andrew".startswith("Andrew") (*1000)
========== startswith multiple characters - not!
0.19    0.17    112.0   "Andrew".startswith("Anders") (*1000)
========== startswith single character
0.16    0.18    88.2    "Andrew".startswith("A") (*1000)
========== strip terminal newline
0.07    0.16    45.5    s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
0.05    0.07    79.2    "\nHello!".rstrip() (*1000)
0.05    0.07    76.5    "Hello!\n".rstrip() (*1000)
0.06    0.07    80.9    "\nHello!\n".strip() (*1000)
0.06    0.07    80.7    "\nHello!".strip() (*1000)
0.05    0.07    77.4    "Hello!\n".strip() (*1000)
========== strip terminal spaces and tabs
0.06    0.08    77.6    "\t   \tHello".rstrip() (*1000)
0.06    0.07    81.8    "Hello\t   \t".rstrip() (*1000)
0.04    0.05    77.5    "Hello\t   \t".strip() (*1000)
========== tab split
0.47    0.50    94.5    GFF3_example.rsplit("\t", 8) (*1000)
0.43    0.47    91.3    GFF3_example.rsplit("\t") (*1000)
0.38    0.43    88.7    GFF3_example.split("\t", 8) (*1000)
0.40    0.46    87.4    GFF3_example.split("\t") (*1000)
157.65  160.53  98.2    TOTAL

vstinner · 2012-03-27T11:29:44Z

Compare stringio total: 160.84 (unpatched) vs 160.53 (patched). I don't see any difference in the benchmarks results. The small differnces are just the noise of the benchmark.

loewis · 2012-03-27T14:43:17Z

-1. Using packed structures may violate all kinds of expectations in extension modules. I consider it important that the data block of a string is well-aligned.

bitdancer · 2012-03-30T21:26:09Z

Looks like this should be closed rejected?

vstinner · 2012-03-30T21:36:51Z

I consider it important that the data block of a string is well-aligned.

I suppose that it doesn't matter for latin1, but it can be a problem for UCS-2 and UCS-4. There are more drawbacks than advantages, so I agree to close this issue. And let's focus on enabling optimizations based on memory alignement like bpo-14419 :-)

vstinner added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Mar 27, 2012

bitdancer added the type-feature A feature request or enhancement label Mar 30, 2012

vstinner closed this as completed Mar 30, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pack PyASCIIObject fields to reduce memory consumption of pure ASCII strings #58630

Pack PyASCIIObject fields to reduce memory consumption of pure ASCII strings #58630

vstinner commented Mar 27, 2012

vstinner commented Mar 27, 2012

vstinner commented Mar 27, 2012

vstinner commented Mar 27, 2012

loewis mannequin commented Mar 27, 2012

bitdancer commented Mar 30, 2012

vstinner commented Mar 30, 2012

Pack PyASCIIObject fields to reduce memory consumption of pure ASCII strings #58630

Pack PyASCIIObject fields to reduce memory consumption of pure ASCII strings #58630

Comments

vstinner commented Mar 27, 2012

vstinner commented Mar 27, 2012

vstinner commented Mar 27, 2012

vstinner commented Mar 27, 2012

loewis mannequin commented Mar 27, 2012

bitdancer commented Mar 30, 2012

vstinner commented Mar 30, 2012