This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Pack PyASCIIObject fields to reduce memory consumption of pure ASCII strings
Type: enhancement Stage:
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: jcea, loewis, pitrou, r.david.murray, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-03-27 11:14 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
pack_pyasciiobject.patch vstinner, 2012-03-27 11:14 review
Messages (6)
msg156905 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-03-27 11:14
It is possible to reduce PyASCIIObject.state to 8 bits instead of 32, move it to the end (exchange wstr and state) of the structure and pack the structure. As a result, the structure size is reduced by 3 bytes (state type changes from int to char).

I expect a low or not overhead on performances because only PyASCIIObject.state field is affected and this field size is 8 bits.

See also the issue #14419 which relies on memory alignment (of the ASCII string data) to optimize the ASCII decoder. If I understand correctly, my patch disables the possibility of this optimization.

--

Example on Linux 32 bits:

$ cat x.c 
#include <Python.h>

int main()
{
    printf("sizeof(PyASCIIObject)=%u bytes\n", sizeof(PyASCIIObject));
    printf("sizeof(PyCompactUnicodeObject)=%u bytes\n", sizeof(PyCompactUnicodeObject));
    printf("sizeof(PyUnicodeObject)=%u bytes\n", sizeof(PyUnicodeObject));
    return 0;
}

# unpatched
$ gcc -I Include/ -I . x.c -o x && ./x
sizeof(PyASCIIObject)=24 bytes
sizeof(PyCompactUnicodeObject)=36 bytes
sizeof(PyUnicodeObject)=40 bytes

# pack the 3 structures
$ gcc -I Include/ -I . x.c -o x && ./x
sizeof(PyASCIIObject)=21 bytes
sizeof(PyCompactUnicodeObject)=33 bytes
sizeof(PyUnicodeObject)=37 bytes

--

We might also pack PyCompactUnicodeObject and PyUnicodeObject but it would have a bad impact on performances because utf8_length, utf8, wstr_length and data would not be aligned anymore.
msg156908 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-03-27 11:23
iobench and stringbench results on unpatched Python:

$ ./python Tools/iobench/iobench.py -t
Preparing files...
Python 3.3.0a1+ (default:51016ff7f8c9, Mar 27 2012, 13:19:52) 
[GCC 4.6.1]
Unicode: PEP 393
Linux-3.0.0-16-generic-pae-i686-with-debian-wheezy-sid
Text unit = one character (utf8-decoded)

** Text input **

[ 400KB ] read one unit at a time...                    5.4 MB/s
[ 400KB ] read 20 units at a time...                     68 MB/s
[ 400KB ] read one line at a time...                    174 MB/s
[ 400KB ] read 4096 units at a time...                  289 MB/s

[  20KB ] read whole contents at once...                315 MB/s
[ 400KB ] read whole contents at once...                332 MB/s
[  10MB ] read whole contents at once...                292 MB/s

[ 400KB ] seek forward one unit at a time...          0.304 MB/s
[ 400KB ] seek forward 1000 units at a time...          312 MB/s

** Text append **

[  20KB ] write one unit at a time...                  3.05 MB/s
[ 400KB ] write 20 units at a time...                    43 MB/s
[ 400KB ] write 4096 units at a time...                 554 MB/s
[  10MB ] write 1e6 units at a time...                  450 MB/s

** Text overwrite **

[  20KB ] modify one unit at a time...                 1.18 MB/s
[ 400KB ] modify 20 units at a time...                 18.9 MB/s
[ 400KB ] modify 4096 units at a time...                400 MB/s

$ ./python stringbench/stringbench.py 
stringbench v2.0
3.3.0a1+ (default:51016ff7f8c9, Mar 27 2012, 13:19:52) 
[GCC 4.6.1]
2012-03-27 13:21:01.217823
bytes   unicode
(in ms) (in ms) %       comment
========== case conversion -- dense
0.37    0.38    97.9    ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower() (*1000)
0.38    0.38    99.3    ("where in the world is carmen san deigo?"*10).upper() (*1000)
========== case conversion -- rare
0.38    0.38    99.9    ("Where in the world is Carmen San Deigo?"*10).lower() (*1000)
0.43    0.38    113.6   ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper() (*1000)
========== concat 20 strings of words length 4 to 15
1.76    1.69    104.2   s1+s2+s3+s4+...+s20 (*1000)
========== concat two strings
0.08    0.07    107.7   "Andrew"+"Dalke" (*1000)
========== count AACT substrings in DNA example
2.15    2.13    100.7   dna.count("AACT") (*10)
========== count newlines
0.65    0.58    110.8   ...text.with.2000.newlines.count("\n") (*10)
========== early match, single character
0.20    0.19    107.9   ("A"*1000).find("A") (*1000)
0.36    0.05    745.8   "A" in "A"*1000 (*1000)
0.18    0.19    96.4    ("A"*1000).index("A") (*1000)
0.18    0.21    85.5    ("A"*1000).partition("A") (*1000)
0.21    0.20    103.6   ("A"*1000).rfind("A") (*1000)
0.21    0.30    69.8    ("A"*1000).rindex("A") (*1000)
0.37    0.21    171.7   ("A"*1000).rpartition("A") (*1000)
0.38    0.39    98.4    ("A"*1000).rsplit("A", 1) (*1000)
0.37    0.37    100.7   ("A"*1000).split("A", 1) (*1000)
========== early match, two characters
0.20    0.19    107.7   ("AB"*1000).find("AB") (*1000)
0.36    0.05    702.1   "AB" in "AB"*1000 (*1000)
0.18    0.19    96.9    ("AB"*1000).index("AB") (*1000)
0.20    0.24    83.9    ("AB"*1000).partition("AB") (*1000)
0.20    0.20    103.6   ("AB"*1000).rfind("AB") (*1000)
0.20    0.19    102.9   ("AB"*1000).rindex("AB") (*1000)
0.20    0.23    86.7    ("AB"*1000).rpartition("AB") (*1000)
0.39    0.40    97.7    ("AB"*1000).rsplit("AB", 1) (*1000)
0.40    0.42    94.4    ("AB"*1000).split("AB", 1) (*1000)
========== endswith multiple characters
0.17    0.19    92.6    "Andrew".endswith("Andrew") (*1000)
========== endswith multiple characters - not!
0.17    0.18    95.2    "Andrew".endswith("Anders") (*1000)
========== endswith single character
0.17    0.18    92.3    "Andrew".endswith("w") (*1000)
========== formatting a string type with a dict
N/A     0.91    0.0     "The %(k1)s is %(k2)s the %(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
========== join empty string, with 1 character sep
N/A     0.04    0.0     "A".join("") (*100)
========== join empty string, with 5 character sep
N/A     0.04    0.0     "ABCDE".join("") (*100)
========== join list of 100 words, with 1 character sep
1.37    1.71    80.0    "A".join(["Bob"]*100)) (*1000)
========== join list of 100 words, with 5 character sep
1.50    1.86    80.8    "ABCDE".join(["Bob"]*100)) (*1000)
========== join list of 26 characters, with 1 character sep
0.48    0.49    99.6    "A".join(list("ABC..Z")) (*1000)
========== join list of 26 characters, with 5 character sep
0.49    0.54    91.3    "ABCDE".join(list("ABC..Z")) (*1000)
========== join string with 26 characters, with 1 character sep
N/A     1.17    0.0     "A".join("ABC..Z") (*1000)
========== join string with 26 characters, with 5 character sep
N/A     1.22    0.0     "ABCDE".join("ABC..Z") (*1000)
========== late match, 100 characters
8.48    8.46    100.2   s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
4.19    3.50    119.9   s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
5.30    5.11    103.7   s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
8.47    8.45    100.2   s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
8.68    8.68    100.0   s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
6.36    6.37    99.8    s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
2.33    2.27    102.4   s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
6.58    6.58    100.1   s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
7.34    6.56    111.9   s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
6.69    7.65    87.5    s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
8.47    8.87    95.4    s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
========== late match, two characters
1.30    1.26    102.7   ("AB"*300+"C").find("BC") (*1000)
1.30    1.27    102.0   ("AB"*300+"CA").find("CA") (*1000)
1.42    1.10    129.6   "BC" in ("AB"*300+"C") (*1000)
1.20    1.20    100.2   ("AB"*300+"C").index("BC") (*1000)
1.16    1.26    92.3    ("AB"*300+"C").partition("BC") (*1000)
0.95    0.94    101.0   ("C"+"AB"*300).rfind("CA") (*1000)
0.90    0.69    131.2   ("BC"+"AB"*300).rfind("BC") (*1000)
0.94    0.94    100.1   ("C"+"AB"*300).rindex("CA") (*1000)
1.02    0.94    108.6   ("C"+"AB"*300).rpartition("CA") (*1000)
1.12    1.08    103.7   ("C"+"AB"*300).rsplit("CA", 1) (*1000)
1.27    1.38    91.8    ("AB"*300+"C").split("BC", 1) (*1000)
========== no match, single character
0.45    0.41    111.1   ("A"*1000).find("B") (*1000)
0.59    0.29    205.4   "B" in "A"*1000 (*1000)
0.30    0.31    97.4    ("A"*1000).partition("B") (*1000)
0.49    0.48    102.5   ("A"*1000).rfind("B") (*1000)
0.36    0.37    96.5    ("A"*1000).rpartition("B") (*1000)
0.77    0.76    101.4   ("A"*1000).rsplit("B", 1) (*1000)
0.83    0.81    101.6   ("A"*1000).split("B", 1) (*1000)
========== no match, two characters
3.80    3.78    100.6   ("AB"*1000).find("BC") (*1000)
4.08    3.68    111.0   ("AB"*1000).find("CA") (*1000)
3.71    3.40    109.2   "BC" in "AB"*1000 (*1000)
3.44    3.42    100.8   ("AB"*1000).partition("BC") (*1000)
2.56    1.86    137.9   ("AB"*1000).rfind("BC") (*1000)
2.69    2.69    100.2   ("AB"*1000).rfind("CA") (*1000)
2.50    1.84    135.6   ("AB"*1000).rpartition("BC") (*1000)
2.03    1.94    104.7   ("AB"*1000).rsplit("BC", 1) (*1000)
3.27    3.56    91.8    ("AB"*1000).split("BC", 1) (*1000)
========== quick replace multiple character match
0.08    0.08    99.7    ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
========== quick replace single character match
0.08    0.09    89.5    ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
========== repeat 1 character 10 times
0.06    0.07    87.0    "A"*10 (*1000)
========== repeat 1 character 1000 times
0.13    0.15    89.3    "A"*1000 (*1000)
========== repeat 5 characters 10 times
0.12    0.09    128.8   "ABCDE"*10 (*1000)
========== repeat 5 characters 1000 times
0.33    0.34    94.8    "ABCDE"*1000 (*1000)
========== replace and expand multiple characters, big string
1.83    2.11    86.4    "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
========== replace multiple characters, dna
3.21    3.23    99.5    dna.replace("ATC", "ATT") (*10)
========== replace single character
0.18    0.25    70.9    "This is a test".replace(" ", "\t") (*1000)
========== replace single character, big string
0.65    0.92    70.1    "...text.with.2000.lines...replace("\n", " ") (*10)
========== replace/remove multiple characters
0.27    0.34    78.7    "When shall we three meet again?".replace("ee", "") (*1000)
========== split 1 whitespace
0.12    0.14    82.7    ("Here are some words. "*2).partition(" ") (*1000)
0.08    0.11    75.9    ("Here are some words. "*2).rpartition(" ") (*1000)
0.23    0.26    87.4    ("Here are some words. "*2).rsplit(None, 1) (*1000)
0.24    0.25    95.9    ("Here are some words. "*2).split(None, 1) (*1000)
========== split 2000 newlines
1.59    1.75    90.8    "...text...".rsplit("\n") (*10)
1.64    1.68    97.5    "...text...".split("\n") (*10)
1.83    2.03    90.1    "...text...".splitlines() (*10)
========== split newlines
0.26    0.29    88.8    "this\nis\na\ntest\n".rsplit("\n") (*1000)
0.27    0.29    92.2    "this\nis\na\ntest\n".split("\n") (*1000)
0.26    0.30    85.8    "this\nis\na\ntest\n".splitlines() (*1000)
========== split on multicharacter separator (dna)
2.18    1.86    117.5   dna.rsplit("ACTAT") (*10)
2.53    2.48    102.0   dna.split("ACTAT") (*10)
========== split on multicharacter separator (small)
0.53    0.59    88.8    "this--is--a--test--of--the--emergency--broadcast--system".rsplit("--") (*1000)
0.59    0.57    102.6   "this--is--a--test--of--the--emergency--broadcast--system".split("--") (*1000)
========== split whitespace (huge)
1.50    1.73    86.9    human_text.rsplit() (*10)
1.49    1.75    85.5    human_text.split() (*10)
========== split whitespace (small)
0.43    0.50    87.0    ("Here are some words. "*2).rsplit() (*1000)
0.40    0.50    79.4    ("Here are some words. "*2).split() (*1000)
========== startswith multiple characters
0.17    0.18    92.0    "Andrew".startswith("Andrew") (*1000)
========== startswith multiple characters - not!
0.17    0.17    99.5    "Andrew".startswith("Anders") (*1000)
========== startswith single character
0.17    0.18    94.0    "Andrew".startswith("A") (*1000)
========== strip terminal newline
0.07    0.15    46.9    s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
0.06    0.07    78.1    "\nHello!".rstrip() (*1000)
0.05    0.13    42.1    "Hello!\n".rstrip() (*1000)
0.06    0.07    77.1    "\nHello!\n".strip() (*1000)
0.06    0.07    77.6    "\nHello!".strip() (*1000)
0.05    0.07    75.0    "Hello!\n".strip() (*1000)
========== strip terminal spaces and tabs
0.06    0.08    74.2    "\t   \tHello".rstrip() (*1000)
0.06    0.07    79.4    "Hello\t   \t".rstrip() (*1000)
0.04    0.05    87.1    "Hello\t   \t".strip() (*1000)
========== tab split
0.44    0.51    87.2    GFF3_example.rsplit("\t", 8) (*1000)
0.42    0.47    89.9    GFF3_example.rsplit("\t") (*1000)
0.39    0.44    88.7    GFF3_example.split("\t", 8) (*1000)
0.41    0.47    86.1    GFF3_example.split("\t") (*1000)
158.46  160.84  98.5    TOTAL

*****************

iobench and stringbench results on patched Python (pack the 3 structures):

$ ./python Tools/iobench/iobench.py -t
Preparing files...
Python 3.3.0a1+ (default:51016ff7f8c9+, Mar 27 2012, 13:11:28) 
[GCC 4.6.1]
Unicode: PEP 393
Linux-3.0.0-16-generic-pae-i686-with-debian-wheezy-sid
Text unit = one character (utf8-decoded)

** Text input **

[ 400KB ] read one unit at a time...                    5.4 MB/s
[ 400KB ] read 20 units at a time...                   68.5 MB/s
[ 400KB ] read one line at a time...                    163 MB/s
[ 400KB ] read 4096 units at a time...                  295 MB/s

[  20KB ] read whole contents at once...                322 MB/s
[ 400KB ] read whole contents at once...                336 MB/s
[  10MB ] read whole contents at once...                289 MB/s

[ 400KB ] seek forward one unit at a time...           0.32 MB/s
[ 400KB ] seek forward 1000 units at a time...          325 MB/s

** Text append **

[  20KB ] write one unit at a time...                  2.99 MB/s
[ 400KB ] write 20 units at a time...                    44 MB/s
[ 400KB ] write 4096 units at a time...                 556 MB/s
[  10MB ] write 1e6 units at a time...                  456 MB/s

** Text overwrite **

[  20KB ] modify one unit at a time...                 1.16 MB/s
[ 400KB ] modify 20 units at a time...                 19.5 MB/s
[ 400KB ] modify 4096 units at a time...                401 MB/s

$ ./python stringbench/stringbench.py 
stringbench v2.0
3.3.0a1+ (default:51016ff7f8c9+, Mar 27 2012, 13:11:28) 
[GCC 4.6.1]
2012-03-27 13:17:42.363789
bytes   unicode
(in ms) (in ms) %       comment
========== case conversion -- dense
0.37    0.38    98.6    ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower() (*1000)
0.37    0.38    98.4    ("where in the world is carmen san deigo?"*10).upper() (*1000)
========== case conversion -- rare
0.37    0.38    98.6    ("Where in the world is Carmen San Deigo?"*10).lower() (*1000)
0.37    0.38    98.4    ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper() (*1000)
========== concat 20 strings of words length 4 to 15
1.86    1.85    100.9   s1+s2+s3+s4+...+s20 (*1000)
========== concat two strings
0.08    0.07    108.0   "Andrew"+"Dalke" (*1000)
========== count AACT substrings in DNA example
2.16    2.12    101.8   dna.count("AACT") (*10)
========== count newlines
0.59    0.58    101.3   ...text.with.2000.newlines.count("\n") (*10)
========== early match, single character
0.18    0.17    103.7   ("A"*1000).find("A") (*1000)
0.36    0.05    775.5   "A" in "A"*1000 (*1000)
0.17    0.17    102.0   ("A"*1000).index("A") (*1000)
0.17    0.20    84.7    ("A"*1000).partition("A") (*1000)
0.19    0.19    102.2   ("A"*1000).rfind("A") (*1000)
0.19    0.38    50.7    ("A"*1000).rindex("A") (*1000)
0.18    0.20    90.0    ("A"*1000).rpartition("A") (*1000)
0.59    0.36    166.9   ("A"*1000).rsplit("A", 1) (*1000)
0.34    0.36    93.5    ("A"*1000).split("A", 1) (*1000)
========== early match, two characters
0.18    0.19    95.8    ("AB"*1000).find("AB") (*1000)
0.44    0.05    891.0   "AB" in "AB"*1000 (*1000)
0.23    0.31    73.4    ("AB"*1000).index("AB") (*1000)
0.22    0.31    70.7    ("AB"*1000).partition("AB") (*1000)
0.19    0.19    101.2   ("AB"*1000).rfind("AB") (*1000)
0.19    0.19    102.0   ("AB"*1000).rindex("AB") (*1000)
0.17    0.21    78.7    ("AB"*1000).rpartition("AB") (*1000)
0.35    0.38    93.0    ("AB"*1000).rsplit("AB", 1) (*1000)
0.39    0.42    93.0    ("AB"*1000).split("AB", 1) (*1000)
========== endswith multiple characters
0.16    0.17    93.0    "Andrew".endswith("Andrew") (*1000)
========== endswith multiple characters - not!
0.16    0.16    101.4   "Andrew".endswith("Anders") (*1000)
========== endswith single character
0.16    0.17    93.7    "Andrew".endswith("w") (*1000)
========== formatting a string type with a dict
N/A     0.86    0.0     "The %(k1)s is %(k2)s the %(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
========== join empty string, with 1 character sep
N/A     0.04    0.0     "A".join("") (*100)
========== join empty string, with 5 character sep
N/A     0.04    0.0     "ABCDE".join("") (*100)
========== join list of 100 words, with 1 character sep
1.42    1.74    81.3    "A".join(["Bob"]*100)) (*1000)
========== join list of 100 words, with 5 character sep
1.62    1.95    83.3    "ABCDE".join(["Bob"]*100)) (*1000)
========== join list of 26 characters, with 1 character sep
0.51    0.57    89.7    "A".join(list("ABC..Z")) (*1000)
========== join list of 26 characters, with 5 character sep
0.58    0.53    108.1   "ABCDE".join(list("ABC..Z")) (*1000)
========== join string with 26 characters, with 1 character sep
N/A     1.30    0.0     "A".join("ABC..Z") (*1000)
========== join string with 26 characters, with 5 character sep
N/A     1.22    0.0     "ABCDE".join("ABC..Z") (*1000)
========== late match, 100 characters
8.50    8.45    100.6   s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
3.70    3.46    107.0   s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
5.11    5.08    100.6   s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
8.62    8.47    101.7   s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
8.80    8.67    101.5   s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
6.39    6.46    99.0    s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
2.31    2.18    105.9   s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
6.41    6.35    100.9   s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
7.41    6.56    112.9   s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
6.59    6.59    100.0   s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
8.00    8.69    92.0    s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
========== late match, two characters
1.20    1.21    99.6    ("AB"*300+"C").find("BC") (*1000)
1.29    1.25    103.1   ("AB"*300+"CA").find("CA") (*1000)
1.41    1.07    130.9   "BC" in ("AB"*300+"C") (*1000)
1.20    1.21    99.3    ("AB"*300+"C").index("BC") (*1000)
1.17    1.20    97.5    ("AB"*300+"C").partition("BC") (*1000)
0.95    0.93    101.4   ("C"+"AB"*300).rfind("CA") (*1000)
0.90    0.69    129.3   ("BC"+"AB"*300).rfind("BC") (*1000)
0.95    0.94    101.2   ("C"+"AB"*300).rindex("CA") (*1000)
1.01    0.94    106.8   ("C"+"AB"*300).rpartition("CA") (*1000)
1.11    1.10    101.5   ("C"+"AB"*300).rsplit("CA", 1) (*1000)
1.28    1.37    93.6    ("AB"*300+"C").split("BC", 1) (*1000)
========== no match, single character
0.41    0.40    101.2   ("A"*1000).find("B") (*1000)
0.59    0.29    203.8   "B" in "A"*1000 (*1000)
0.29    0.30    95.7    ("A"*1000).partition("B") (*1000)
0.49    0.48    101.4   ("A"*1000).rfind("B") (*1000)
0.37    0.38    97.3    ("A"*1000).rpartition("B") (*1000)
0.76    0.75    101.1   ("A"*1000).rsplit("B", 1) (*1000)
0.76    0.75    100.9   ("A"*1000).split("B", 1) (*1000)
========== no match, two characters
3.53    3.52    100.2   ("AB"*1000).find("BC") (*1000)
3.92    3.67    106.9   ("AB"*1000).find("CA") (*1000)
3.71    3.39    109.6   "BC" in "AB"*1000 (*1000)
3.40    3.42    99.5    ("AB"*1000).partition("BC") (*1000)
2.55    1.90    134.2   ("AB"*1000).rfind("BC") (*1000)
2.69    2.68    100.1   ("AB"*1000).rfind("CA") (*1000)
2.43    1.81    133.9   ("AB"*1000).rpartition("BC") (*1000)
2.02    1.92    104.8   ("AB"*1000).rsplit("BC", 1) (*1000)
3.27    3.54    92.4    ("AB"*1000).split("BC", 1) (*1000)
========== quick replace multiple character match
0.09    0.08    107.7   ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
========== quick replace single character match
0.09    0.08    108.7   ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
========== repeat 1 character 10 times
0.06    0.07    87.5    "A"*10 (*1000)
========== repeat 1 character 1000 times
0.16    0.12    135.0   "A"*1000 (*1000)
========== repeat 5 characters 10 times
0.11    0.10    104.9   "ABCDE"*10 (*1000)
========== repeat 5 characters 1000 times
0.35    0.37    93.7    "ABCDE"*1000 (*1000)
========== replace and expand multiple characters, big string
1.78    2.04    87.3    "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
========== replace multiple characters, dna
3.20    3.25    98.5    dna.replace("ATC", "ATT") (*10)
========== replace single character
0.17    0.24    73.0    "This is a test".replace(" ", "\t") (*1000)
========== replace single character, big string
0.62    0.88    69.7    "...text.with.2000.lines...replace("\n", " ") (*10)
========== replace/remove multiple characters
0.25    0.32    78.3    "When shall we three meet again?".replace("ee", "") (*1000)
========== split 1 whitespace
0.10    0.13    78.9    ("Here are some words. "*2).partition(" ") (*1000)
0.08    0.11    76.8    ("Here are some words. "*2).rpartition(" ") (*1000)
0.23    0.25    91.7    ("Here are some words. "*2).rsplit(None, 1) (*1000)
0.23    0.26    87.1    ("Here are some words. "*2).split(None, 1) (*1000)
========== split 2000 newlines
1.60    1.75    91.7    "...text...".rsplit("\n") (*10)
1.56    1.65    94.3    "...text...".split("\n") (*10)
1.78    2.04    87.0    "...text...".splitlines() (*10)
========== split newlines
0.27    0.29    92.6    "this\nis\na\ntest\n".rsplit("\n") (*1000)
0.27    0.29    94.2    "this\nis\na\ntest\n".split("\n") (*1000)
0.26    0.29    90.4    "this\nis\na\ntest\n".splitlines() (*1000)
========== split on multicharacter separator (dna)
2.09    1.92    108.5   dna.rsplit("ACTAT") (*10)
2.56    2.64    96.9    dna.split("ACTAT") (*10)
========== split on multicharacter separator (small)
0.72    0.89    81.1    "this--is--a--test--of--the--emergency--broadcast--system".rsplit("--") (*1000)
0.75    0.65    114.5   "this--is--a--test--of--the--emergency--broadcast--system".split("--") (*1000)
========== split whitespace (huge)
1.50    1.73    86.3    human_text.rsplit() (*10)
2.25    2.68    83.8    human_text.split() (*10)
========== split whitespace (small)
0.42    0.51    82.0    ("Here are some words. "*2).rsplit() (*1000)
0.41    0.48    86.7    ("Here are some words. "*2).split() (*1000)
========== startswith multiple characters
0.16    0.18    88.9    "Andrew".startswith("Andrew") (*1000)
========== startswith multiple characters - not!
0.19    0.17    112.0   "Andrew".startswith("Anders") (*1000)
========== startswith single character
0.16    0.18    88.2    "Andrew".startswith("A") (*1000)
========== strip terminal newline
0.07    0.16    45.5    s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
0.05    0.07    79.2    "\nHello!".rstrip() (*1000)
0.05    0.07    76.5    "Hello!\n".rstrip() (*1000)
0.06    0.07    80.9    "\nHello!\n".strip() (*1000)
0.06    0.07    80.7    "\nHello!".strip() (*1000)
0.05    0.07    77.4    "Hello!\n".strip() (*1000)
========== strip terminal spaces and tabs
0.06    0.08    77.6    "\t   \tHello".rstrip() (*1000)
0.06    0.07    81.8    "Hello\t   \t".rstrip() (*1000)
0.04    0.05    77.5    "Hello\t   \t".strip() (*1000)
========== tab split
0.47    0.50    94.5    GFF3_example.rsplit("\t", 8) (*1000)
0.43    0.47    91.3    GFF3_example.rsplit("\t") (*1000)
0.38    0.43    88.7    GFF3_example.split("\t", 8) (*1000)
0.40    0.46    87.4    GFF3_example.split("\t") (*1000)
157.65  160.53  98.2    TOTAL
msg156910 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-03-27 11:29
Compare stringio total: 160.84 (unpatched) vs 160.53 (patched). I don't see any difference in the benchmarks results. The small differnces are just the noise of the benchmark.
msg156930 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-27 14:43
-1. Using packed structures may violate all kinds of expectations in extension modules. I consider it important that the data block of a string is well-aligned.
msg157149 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-30 21:26
Looks like this should be closed rejected?
msg157150 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-03-30 21:36
> I consider it important that the data block of a string is well-aligned.

I suppose that it doesn't matter for latin1, but it can be a problem for UCS-2 and UCS-4. There are more drawbacks than advantages, so I agree to close this issue. And let's focus on enabling optimizations based on memory alignement like #14419 :-)
History
Date User Action Args
2022-04-11 14:57:28adminsetgithub: 58630
2012-03-30 21:36:51vstinnersetstatus: open -> closed
resolution: wont fix
messages: + msg157150
2012-03-30 21:26:09r.david.murraysettype: enhancement

messages: + msg157149
nosy: + r.david.murray
2012-03-30 16:49:31jceasetnosy: + jcea
2012-03-27 14:43:17loewissetmessages: + msg156930
2012-03-27 11:29:43vstinnersetmessages: + msg156910
2012-03-27 11:23:02vstinnersetmessages: + msg156908
2012-03-27 11:14:17vstinnercreate