Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pack PyASCIIObject fields to reduce memory consumption of pure ASCII strings #58630

Closed
vstinner opened this issue Mar 27, 2012 · 6 comments
Closed
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

BPO 14422
Nosy @loewis, @jcea, @pitrou, @vstinner, @bitdancer, @serhiy-storchaka
Files
  • pack_pyasciiobject.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-03-30.21:36:51.293>
    created_at = <Date 2012-03-27.11:14:17.742>
    labels = ['interpreter-core', 'type-feature']
    title = 'Pack PyASCIIObject fields to reduce memory consumption of pure ASCII strings'
    updated_at = <Date 2012-03-30.21:36:51.291>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2012-03-30.21:36:51.291>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-03-30.21:36:51.293>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2012-03-27.11:14:17.742>
    creator = 'vstinner'
    dependencies = []
    files = ['25037']
    hgrepos = []
    issue_num = 14422
    keywords = ['patch']
    message_count = 6.0
    messages = ['156905', '156908', '156910', '156930', '157149', '157150']
    nosy_count = 6.0
    nosy_names = ['loewis', 'jcea', 'pitrou', 'vstinner', 'r.david.murray', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = None
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue14422'
    versions = ['Python 3.3']

    @vstinner
    Copy link
    Member Author

    It is possible to reduce PyASCIIObject.state to 8 bits instead of 32, move it to the end (exchange wstr and state) of the structure and pack the structure. As a result, the structure size is reduced by 3 bytes (state type changes from int to char).

    I expect a low or not overhead on performances because only PyASCIIObject.state field is affected and this field size is 8 bits.

    See also the issue bpo-14419 which relies on memory alignment (of the ASCII string data) to optimize the ASCII decoder. If I understand correctly, my patch disables the possibility of this optimization.

    --

    Example on Linux 32 bits:

    $ cat x.c 
    #include <Python.h>
    
    int main()
    {
        printf("sizeof(PyASCIIObject)=%u bytes\n", sizeof(PyASCIIObject));
        printf("sizeof(PyCompactUnicodeObject)=%u bytes\n", sizeof(PyCompactUnicodeObject));
        printf("sizeof(PyUnicodeObject)=%u bytes\n", sizeof(PyUnicodeObject));
        return 0;
    }

    # unpatched
    $ gcc -I Include/ -I . x.c -o x && ./x
    sizeof(PyASCIIObject)=24 bytes
    sizeof(PyCompactUnicodeObject)=36 bytes
    sizeof(PyUnicodeObject)=40 bytes

    # pack the 3 structures
    $ gcc -I Include/ -I . x.c -o x && ./x
    sizeof(PyASCIIObject)=21 bytes
    sizeof(PyCompactUnicodeObject)=33 bytes
    sizeof(PyUnicodeObject)=37 bytes

    --

    We might also pack PyCompactUnicodeObject and PyUnicodeObject but it would have a bad impact on performances because utf8_length, utf8, wstr_length and data would not be aligned anymore.

    @vstinner vstinner added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Mar 27, 2012
    @vstinner
    Copy link
    Member Author

    iobench and stringbench results on unpatched Python:

    $ ./python Tools/iobench/iobench.py -t
    Preparing files...
    Python 3.3.0a1+ (default:51016ff7f8c9, Mar 27 2012, 13:19:52) 
    [GCC 4.6.1]
    Unicode: PEP 393
    Linux-3.0.0-16-generic-pae-i686-with-debian-wheezy-sid
    Text unit = one character (utf8-decoded)

    ** Text input **

    [ 400KB ] read one unit at a time... 5.4 MB/s
    [ 400KB ] read 20 units at a time... 68 MB/s
    [ 400KB ] read one line at a time... 174 MB/s
    [ 400KB ] read 4096 units at a time... 289 MB/s

    [ 20KB ] read whole contents at once... 315 MB/s
    [ 400KB ] read whole contents at once... 332 MB/s
    [ 10MB ] read whole contents at once... 292 MB/s

    [ 400KB ] seek forward one unit at a time... 0.304 MB/s
    [ 400KB ] seek forward 1000 units at a time... 312 MB/s

    ** Text append **

    [ 20KB ] write one unit at a time... 3.05 MB/s
    [ 400KB ] write 20 units at a time... 43 MB/s
    [ 400KB ] write 4096 units at a time... 554 MB/s
    [ 10MB ] write 1e6 units at a time... 450 MB/s

    ** Text overwrite **

    [ 20KB ] modify one unit at a time... 1.18 MB/s
    [ 400KB ] modify 20 units at a time... 18.9 MB/s
    [ 400KB ] modify 4096 units at a time... 400 MB/s

    $ ./python stringbench/stringbench.py 
    stringbench v2.0
    3.3.0a1+ (default:51016ff7f8c9, Mar 27 2012, 13:19:52) 
    [GCC 4.6.1]
    2012-03-27 13:21:01.217823
    bytes   unicode
    (in ms) (in ms) %       comment
    ========== case conversion -- dense
    0.37    0.38    97.9    ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower() (*1000)
    0.38    0.38    99.3    ("where in the world is carmen san deigo?"*10).upper() (*1000)
    ========== case conversion -- rare
    0.38    0.38    99.9    ("Where in the world is Carmen San Deigo?"*10).lower() (*1000)
    0.43    0.38    113.6   ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper() (*1000)
    ========== concat 20 strings of words length 4 to 15
    1.76    1.69    104.2   s1+s2+s3+s4+...+s20 (*1000)
    ========== concat two strings
    0.08    0.07    107.7   "Andrew"+"Dalke" (*1000)
    ========== count AACT substrings in DNA example
    2.15    2.13    100.7   dna.count("AACT") (*10)
    ========== count newlines
    0.65    0.58    110.8   ...text.with.2000.newlines.count("\n") (*10)
    ========== early match, single character
    0.20    0.19    107.9   ("A"*1000).find("A") (*1000)
    0.36    0.05    745.8   "A" in "A"*1000 (*1000)
    0.18    0.19    96.4    ("A"*1000).index("A") (*1000)
    0.18    0.21    85.5    ("A"*1000).partition("A") (*1000)
    0.21    0.20    103.6   ("A"*1000).rfind("A") (*1000)
    0.21    0.30    69.8    ("A"*1000).rindex("A") (*1000)
    0.37    0.21    171.7   ("A"*1000).rpartition("A") (*1000)
    0.38    0.39    98.4    ("A"*1000).rsplit("A", 1) (*1000)
    0.37    0.37    100.7   ("A"*1000).split("A", 1) (*1000)
    ========== early match, two characters
    0.20    0.19    107.7   ("AB"*1000).find("AB") (*1000)
    0.36    0.05    702.1   "AB" in "AB"*1000 (*1000)
    0.18    0.19    96.9    ("AB"*1000).index("AB") (*1000)
    0.20    0.24    83.9    ("AB"*1000).partition("AB") (*1000)
    0.20    0.20    103.6   ("AB"*1000).rfind("AB") (*1000)
    0.20    0.19    102.9   ("AB"*1000).rindex("AB") (*1000)
    0.20    0.23    86.7    ("AB"*1000).rpartition("AB") (*1000)
    0.39    0.40    97.7    ("AB"*1000).rsplit("AB", 1) (*1000)
    0.40    0.42    94.4    ("AB"*1000).split("AB", 1) (*1000)
    ========== endswith multiple characters
    0.17    0.19    92.6    "Andrew".endswith("Andrew") (*1000)
    ========== endswith multiple characters - not!
    0.17    0.18    95.2    "Andrew".endswith("Anders") (*1000)
    ========== endswith single character
    0.17    0.18    92.3    "Andrew".endswith("w") (*1000)
    ========== formatting a string type with a dict
    N/A     0.91    0.0     "The %(k1)s is %(k2)s the %(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
    ========== join empty string, with 1 character sep
    N/A     0.04    0.0     "A".join("") (*100)
    ========== join empty string, with 5 character sep
    N/A     0.04    0.0     "ABCDE".join("") (*100)
    ========== join list of 100 words, with 1 character sep
    1.37    1.71    80.0    "A".join(["Bob"]*100)) (*1000)
    ========== join list of 100 words, with 5 character sep
    1.50    1.86    80.8    "ABCDE".join(["Bob"]*100)) (*1000)
    ========== join list of 26 characters, with 1 character sep
    0.48    0.49    99.6    "A".join(list("ABC..Z")) (*1000)
    ========== join list of 26 characters, with 5 character sep
    0.49    0.54    91.3    "ABCDE".join(list("ABC..Z")) (*1000)
    ========== join string with 26 characters, with 1 character sep
    N/A     1.17    0.0     "A".join("ABC..Z") (*1000)
    ========== join string with 26 characters, with 5 character sep
    N/A     1.22    0.0     "ABCDE".join("ABC..Z") (*1000)
    ========== late match, 100 characters
    8.48    8.46    100.2   s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
    4.19    3.50    119.9   s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
    5.30    5.11    103.7   s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
    8.47    8.45    100.2   s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
    8.68    8.68    100.0   s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
    6.36    6.37    99.8    s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
    2.33    2.27    102.4   s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
    6.58    6.58    100.1   s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
    7.34    6.56    111.9   s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
    6.69    7.65    87.5    s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
    8.47    8.87    95.4    s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
    ========== late match, two characters
    1.30    1.26    102.7   ("AB"*300+"C").find("BC") (*1000)
    1.30    1.27    102.0   ("AB"*300+"CA").find("CA") (*1000)
    1.42    1.10    129.6   "BC" in ("AB"*300+"C") (*1000)
    1.20    1.20    100.2   ("AB"*300+"C").index("BC") (*1000)
    1.16    1.26    92.3    ("AB"*300+"C").partition("BC") (*1000)
    0.95    0.94    101.0   ("C"+"AB"*300).rfind("CA") (*1000)
    0.90    0.69    131.2   ("BC"+"AB"*300).rfind("BC") (*1000)
    0.94    0.94    100.1   ("C"+"AB"*300).rindex("CA") (*1000)
    1.02    0.94    108.6   ("C"+"AB"*300).rpartition("CA") (*1000)
    1.12    1.08    103.7   ("C"+"AB"*300).rsplit("CA", 1) (*1000)
    1.27    1.38    91.8    ("AB"*300+"C").split("BC", 1) (*1000)
    ========== no match, single character
    0.45    0.41    111.1   ("A"*1000).find("B") (*1000)
    0.59    0.29    205.4   "B" in "A"*1000 (*1000)
    0.30    0.31    97.4    ("A"*1000).partition("B") (*1000)
    0.49    0.48    102.5   ("A"*1000).rfind("B") (*1000)
    0.36    0.37    96.5    ("A"*1000).rpartition("B") (*1000)
    0.77    0.76    101.4   ("A"*1000).rsplit("B", 1) (*1000)
    0.83    0.81    101.6   ("A"*1000).split("B", 1) (*1000)
    ========== no match, two characters
    3.80    3.78    100.6   ("AB"*1000).find("BC") (*1000)
    4.08    3.68    111.0   ("AB"*1000).find("CA") (*1000)
    3.71    3.40    109.2   "BC" in "AB"*1000 (*1000)
    3.44    3.42    100.8   ("AB"*1000).partition("BC") (*1000)
    2.56    1.86    137.9   ("AB"*1000).rfind("BC") (*1000)
    2.69    2.69    100.2   ("AB"*1000).rfind("CA") (*1000)
    2.50    1.84    135.6   ("AB"*1000).rpartition("BC") (*1000)
    2.03    1.94    104.7   ("AB"*1000).rsplit("BC", 1) (*1000)
    3.27    3.56    91.8    ("AB"*1000).split("BC", 1) (*1000)
    ========== quick replace multiple character match
    0.08    0.08    99.7    ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
    ========== quick replace single character match
    0.08    0.09    89.5    ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
    ========== repeat 1 character 10 times
    0.06    0.07    87.0    "A"*10 (*1000)
    ========== repeat 1 character 1000 times
    0.13    0.15    89.3    "A"*1000 (*1000)
    ========== repeat 5 characters 10 times
    0.12    0.09    128.8   "ABCDE"*10 (*1000)
    ========== repeat 5 characters 1000 times
    0.33    0.34    94.8    "ABCDE"*1000 (*1000)
    ========== replace and expand multiple characters, big string
    1.83    2.11    86.4    "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
    ========== replace multiple characters, dna
    3.21    3.23    99.5    dna.replace("ATC", "ATT") (*10)
    ========== replace single character
    0.18    0.25    70.9    "This is a test".replace(" ", "\t") (*1000)
    ========== replace single character, big string
    0.65    0.92    70.1    "...text.with.2000.lines...replace("\n", " ") (*10)
    ========== replace/remove multiple characters
    0.27    0.34    78.7    "When shall we three meet again?".replace("ee", "") (*1000)
    ========== split 1 whitespace
    0.12    0.14    82.7    ("Here are some words. "*2).partition(" ") (*1000)
    0.08    0.11    75.9    ("Here are some words. "*2).rpartition(" ") (*1000)
    0.23    0.26    87.4    ("Here are some words. "*2).rsplit(None, 1) (*1000)
    0.24    0.25    95.9    ("Here are some words. "*2).split(None, 1) (*1000)
    ========== split 2000 newlines
    1.59    1.75    90.8    "...text...".rsplit("\n") (*10)
    1.64    1.68    97.5    "...text...".split("\n") (*10)
    1.83    2.03    90.1    "...text...".splitlines() (*10)
    ========== split newlines
    0.26    0.29    88.8    "this\nis\na\ntest\n".rsplit("\n") (*1000)
    0.27    0.29    92.2    "this\nis\na\ntest\n".split("\n") (*1000)
    0.26    0.30    85.8    "this\nis\na\ntest\n".splitlines() (*1000)
    ========== split on multicharacter separator (dna)
    2.18    1.86    117.5   dna.rsplit("ACTAT") (*10)
    2.53    2.48    102.0   dna.split("ACTAT") (*10)
    ========== split on multicharacter separator (small)
    0.53    0.59    88.8    "this--is--a--test--of--the--emergency--broadcast--system".rsplit("--") (*1000)
    0.59    0.57    102.6   "this--is--a--test--of--the--emergency--broadcast--system".split("--") (*1000)
    ========== split whitespace (huge)
    1.50    1.73    86.9    human_text.rsplit() (*10)
    1.49    1.75    85.5    human_text.split() (*10)
    ========== split whitespace (small)
    0.43    0.50    87.0    ("Here are some words. "*2).rsplit() (*1000)
    0.40    0.50    79.4    ("Here are some words. "*2).split() (*1000)
    ========== startswith multiple characters
    0.17    0.18    92.0    "Andrew".startswith("Andrew") (*1000)
    ========== startswith multiple characters - not!
    0.17    0.17    99.5    "Andrew".startswith("Anders") (*1000)
    ========== startswith single character
    0.17    0.18    94.0    "Andrew".startswith("A") (*1000)
    ========== strip terminal newline
    0.07    0.15    46.9    s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
    0.06    0.07    78.1    "\nHello!".rstrip() (*1000)
    0.05    0.13    42.1    "Hello!\n".rstrip() (*1000)
    0.06    0.07    77.1    "\nHello!\n".strip() (*1000)
    0.06    0.07    77.6    "\nHello!".strip() (*1000)
    0.05    0.07    75.0    "Hello!\n".strip() (*1000)
    ========== strip terminal spaces and tabs
    0.06    0.08    74.2    "\t   \tHello".rstrip() (*1000)
    0.06    0.07    79.4    "Hello\t   \t".rstrip() (*1000)
    0.04    0.05    87.1    "Hello\t   \t".strip() (*1000)
    ========== tab split
    0.44    0.51    87.2    GFF3_example.rsplit("\t", 8) (*1000)
    0.42    0.47    89.9    GFF3_example.rsplit("\t") (*1000)
    0.39    0.44    88.7    GFF3_example.split("\t", 8) (*1000)
    0.41    0.47    86.1    GFF3_example.split("\t") (*1000)
    158.46  160.84  98.5    TOTAL

    iobench and stringbench results on patched Python (pack the 3 structures):

    $ ./python Tools/iobench/iobench.py -t
    Preparing files...
    Python 3.3.0a1+ (default:51016ff7f8c9+, Mar 27 2012, 13:11:28) 
    [GCC 4.6.1]
    Unicode: PEP 393
    Linux-3.0.0-16-generic-pae-i686-with-debian-wheezy-sid
    Text unit = one character (utf8-decoded)

    ** Text input **

    [ 400KB ] read one unit at a time... 5.4 MB/s
    [ 400KB ] read 20 units at a time... 68.5 MB/s
    [ 400KB ] read one line at a time... 163 MB/s
    [ 400KB ] read 4096 units at a time... 295 MB/s

    [ 20KB ] read whole contents at once... 322 MB/s
    [ 400KB ] read whole contents at once... 336 MB/s
    [ 10MB ] read whole contents at once... 289 MB/s

    [ 400KB ] seek forward one unit at a time... 0.32 MB/s
    [ 400KB ] seek forward 1000 units at a time... 325 MB/s

    ** Text append **

    [ 20KB ] write one unit at a time... 2.99 MB/s
    [ 400KB ] write 20 units at a time... 44 MB/s
    [ 400KB ] write 4096 units at a time... 556 MB/s
    [ 10MB ] write 1e6 units at a time... 456 MB/s

    ** Text overwrite **

    [ 20KB ] modify one unit at a time... 1.16 MB/s
    [ 400KB ] modify 20 units at a time... 19.5 MB/s
    [ 400KB ] modify 4096 units at a time... 401 MB/s

    $ ./python stringbench/stringbench.py 
    stringbench v2.0
    3.3.0a1+ (default:51016ff7f8c9+, Mar 27 2012, 13:11:28) 
    [GCC 4.6.1]
    2012-03-27 13:17:42.363789
    bytes   unicode
    (in ms) (in ms) %       comment
    ========== case conversion -- dense
    0.37    0.38    98.6    ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower() (*1000)
    0.37    0.38    98.4    ("where in the world is carmen san deigo?"*10).upper() (*1000)
    ========== case conversion -- rare
    0.37    0.38    98.6    ("Where in the world is Carmen San Deigo?"*10).lower() (*1000)
    0.37    0.38    98.4    ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper() (*1000)
    ========== concat 20 strings of words length 4 to 15
    1.86    1.85    100.9   s1+s2+s3+s4+...+s20 (*1000)
    ========== concat two strings
    0.08    0.07    108.0   "Andrew"+"Dalke" (*1000)
    ========== count AACT substrings in DNA example
    2.16    2.12    101.8   dna.count("AACT") (*10)
    ========== count newlines
    0.59    0.58    101.3   ...text.with.2000.newlines.count("\n") (*10)
    ========== early match, single character
    0.18    0.17    103.7   ("A"*1000).find("A") (*1000)
    0.36    0.05    775.5   "A" in "A"*1000 (*1000)
    0.17    0.17    102.0   ("A"*1000).index("A") (*1000)
    0.17    0.20    84.7    ("A"*1000).partition("A") (*1000)
    0.19    0.19    102.2   ("A"*1000).rfind("A") (*1000)
    0.19    0.38    50.7    ("A"*1000).rindex("A") (*1000)
    0.18    0.20    90.0    ("A"*1000).rpartition("A") (*1000)
    0.59    0.36    166.9   ("A"*1000).rsplit("A", 1) (*1000)
    0.34    0.36    93.5    ("A"*1000).split("A", 1) (*1000)
    ========== early match, two characters
    0.18    0.19    95.8    ("AB"*1000).find("AB") (*1000)
    0.44    0.05    891.0   "AB" in "AB"*1000 (*1000)
    0.23    0.31    73.4    ("AB"*1000).index("AB") (*1000)
    0.22    0.31    70.7    ("AB"*1000).partition("AB") (*1000)
    0.19    0.19    101.2   ("AB"*1000).rfind("AB") (*1000)
    0.19    0.19    102.0   ("AB"*1000).rindex("AB") (*1000)
    0.17    0.21    78.7    ("AB"*1000).rpartition("AB") (*1000)
    0.35    0.38    93.0    ("AB"*1000).rsplit("AB", 1) (*1000)
    0.39    0.42    93.0    ("AB"*1000).split("AB", 1) (*1000)
    ========== endswith multiple characters
    0.16    0.17    93.0    "Andrew".endswith("Andrew") (*1000)
    ========== endswith multiple characters - not!
    0.16    0.16    101.4   "Andrew".endswith("Anders") (*1000)
    ========== endswith single character
    0.16    0.17    93.7    "Andrew".endswith("w") (*1000)
    ========== formatting a string type with a dict
    N/A     0.86    0.0     "The %(k1)s is %(k2)s the %(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
    ========== join empty string, with 1 character sep
    N/A     0.04    0.0     "A".join("") (*100)
    ========== join empty string, with 5 character sep
    N/A     0.04    0.0     "ABCDE".join("") (*100)
    ========== join list of 100 words, with 1 character sep
    1.42    1.74    81.3    "A".join(["Bob"]*100)) (*1000)
    ========== join list of 100 words, with 5 character sep
    1.62    1.95    83.3    "ABCDE".join(["Bob"]*100)) (*1000)
    ========== join list of 26 characters, with 1 character sep
    0.51    0.57    89.7    "A".join(list("ABC..Z")) (*1000)
    ========== join list of 26 characters, with 5 character sep
    0.58    0.53    108.1   "ABCDE".join(list("ABC..Z")) (*1000)
    ========== join string with 26 characters, with 1 character sep
    N/A     1.30    0.0     "A".join("ABC..Z") (*1000)
    ========== join string with 26 characters, with 5 character sep
    N/A     1.22    0.0     "ABCDE".join("ABC..Z") (*1000)
    ========== late match, 100 characters
    8.50    8.45    100.6   s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
    3.70    3.46    107.0   s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
    5.11    5.08    100.6   s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
    8.62    8.47    101.7   s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
    8.80    8.67    101.5   s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
    6.39    6.46    99.0    s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
    2.31    2.18    105.9   s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
    6.41    6.35    100.9   s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
    7.41    6.56    112.9   s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
    6.59    6.59    100.0   s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
    8.00    8.69    92.0    s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
    ========== late match, two characters
    1.20    1.21    99.6    ("AB"*300+"C").find("BC") (*1000)
    1.29    1.25    103.1   ("AB"*300+"CA").find("CA") (*1000)
    1.41    1.07    130.9   "BC" in ("AB"*300+"C") (*1000)
    1.20    1.21    99.3    ("AB"*300+"C").index("BC") (*1000)
    1.17    1.20    97.5    ("AB"*300+"C").partition("BC") (*1000)
    0.95    0.93    101.4   ("C"+"AB"*300).rfind("CA") (*1000)
    0.90    0.69    129.3   ("BC"+"AB"*300).rfind("BC") (*1000)
    0.95    0.94    101.2   ("C"+"AB"*300).rindex("CA") (*1000)
    1.01    0.94    106.8   ("C"+"AB"*300).rpartition("CA") (*1000)
    1.11    1.10    101.5   ("C"+"AB"*300).rsplit("CA", 1) (*1000)
    1.28    1.37    93.6    ("AB"*300+"C").split("BC", 1) (*1000)
    ========== no match, single character
    0.41    0.40    101.2   ("A"*1000).find("B") (*1000)
    0.59    0.29    203.8   "B" in "A"*1000 (*1000)
    0.29    0.30    95.7    ("A"*1000).partition("B") (*1000)
    0.49    0.48    101.4   ("A"*1000).rfind("B") (*1000)
    0.37    0.38    97.3    ("A"*1000).rpartition("B") (*1000)
    0.76    0.75    101.1   ("A"*1000).rsplit("B", 1) (*1000)
    0.76    0.75    100.9   ("A"*1000).split("B", 1) (*1000)
    ========== no match, two characters
    3.53    3.52    100.2   ("AB"*1000).find("BC") (*1000)
    3.92    3.67    106.9   ("AB"*1000).find("CA") (*1000)
    3.71    3.39    109.6   "BC" in "AB"*1000 (*1000)
    3.40    3.42    99.5    ("AB"*1000).partition("BC") (*1000)
    2.55    1.90    134.2   ("AB"*1000).rfind("BC") (*1000)
    2.69    2.68    100.1   ("AB"*1000).rfind("CA") (*1000)
    2.43    1.81    133.9   ("AB"*1000).rpartition("BC") (*1000)
    2.02    1.92    104.8   ("AB"*1000).rsplit("BC", 1) (*1000)
    3.27    3.54    92.4    ("AB"*1000).split("BC", 1) (*1000)
    ========== quick replace multiple character match
    0.09    0.08    107.7   ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
    ========== quick replace single character match
    0.09    0.08    108.7   ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
    ========== repeat 1 character 10 times
    0.06    0.07    87.5    "A"*10 (*1000)
    ========== repeat 1 character 1000 times
    0.16    0.12    135.0   "A"*1000 (*1000)
    ========== repeat 5 characters 10 times
    0.11    0.10    104.9   "ABCDE"*10 (*1000)
    ========== repeat 5 characters 1000 times
    0.35    0.37    93.7    "ABCDE"*1000 (*1000)
    ========== replace and expand multiple characters, big string
    1.78    2.04    87.3    "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
    ========== replace multiple characters, dna
    3.20    3.25    98.5    dna.replace("ATC", "ATT") (*10)
    ========== replace single character
    0.17    0.24    73.0    "This is a test".replace(" ", "\t") (*1000)
    ========== replace single character, big string
    0.62    0.88    69.7    "...text.with.2000.lines...replace("\n", " ") (*10)
    ========== replace/remove multiple characters
    0.25    0.32    78.3    "When shall we three meet again?".replace("ee", "") (*1000)
    ========== split 1 whitespace
    0.10    0.13    78.9    ("Here are some words. "*2).partition(" ") (*1000)
    0.08    0.11    76.8    ("Here are some words. "*2).rpartition(" ") (*1000)
    0.23    0.25    91.7    ("Here are some words. "*2).rsplit(None, 1) (*1000)
    0.23    0.26    87.1    ("Here are some words. "*2).split(None, 1) (*1000)
    ========== split 2000 newlines
    1.60    1.75    91.7    "...text...".rsplit("\n") (*10)
    1.56    1.65    94.3    "...text...".split("\n") (*10)
    1.78    2.04    87.0    "...text...".splitlines() (*10)
    ========== split newlines
    0.27    0.29    92.6    "this\nis\na\ntest\n".rsplit("\n") (*1000)
    0.27    0.29    94.2    "this\nis\na\ntest\n".split("\n") (*1000)
    0.26    0.29    90.4    "this\nis\na\ntest\n".splitlines() (*1000)
    ========== split on multicharacter separator (dna)
    2.09    1.92    108.5   dna.rsplit("ACTAT") (*10)
    2.56    2.64    96.9    dna.split("ACTAT") (*10)
    ========== split on multicharacter separator (small)
    0.72    0.89    81.1    "this--is--a--test--of--the--emergency--broadcast--system".rsplit("--") (*1000)
    0.75    0.65    114.5   "this--is--a--test--of--the--emergency--broadcast--system".split("--") (*1000)
    ========== split whitespace (huge)
    1.50    1.73    86.3    human_text.rsplit() (*10)
    2.25    2.68    83.8    human_text.split() (*10)
    ========== split whitespace (small)
    0.42    0.51    82.0    ("Here are some words. "*2).rsplit() (*1000)
    0.41    0.48    86.7    ("Here are some words. "*2).split() (*1000)
    ========== startswith multiple characters
    0.16    0.18    88.9    "Andrew".startswith("Andrew") (*1000)
    ========== startswith multiple characters - not!
    0.19    0.17    112.0   "Andrew".startswith("Anders") (*1000)
    ========== startswith single character
    0.16    0.18    88.2    "Andrew".startswith("A") (*1000)
    ========== strip terminal newline
    0.07    0.16    45.5    s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
    0.05    0.07    79.2    "\nHello!".rstrip() (*1000)
    0.05    0.07    76.5    "Hello!\n".rstrip() (*1000)
    0.06    0.07    80.9    "\nHello!\n".strip() (*1000)
    0.06    0.07    80.7    "\nHello!".strip() (*1000)
    0.05    0.07    77.4    "Hello!\n".strip() (*1000)
    ========== strip terminal spaces and tabs
    0.06    0.08    77.6    "\t   \tHello".rstrip() (*1000)
    0.06    0.07    81.8    "Hello\t   \t".rstrip() (*1000)
    0.04    0.05    77.5    "Hello\t   \t".strip() (*1000)
    ========== tab split
    0.47    0.50    94.5    GFF3_example.rsplit("\t", 8) (*1000)
    0.43    0.47    91.3    GFF3_example.rsplit("\t") (*1000)
    0.38    0.43    88.7    GFF3_example.split("\t", 8) (*1000)
    0.40    0.46    87.4    GFF3_example.split("\t") (*1000)
    157.65  160.53  98.2    TOTAL

    @vstinner
    Copy link
    Member Author

    Compare stringio total: 160.84 (unpatched) vs 160.53 (patched). I don't see any difference in the benchmarks results. The small differnces are just the noise of the benchmark.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 27, 2012

    -1. Using packed structures may violate all kinds of expectations in extension modules. I consider it important that the data block of a string is well-aligned.

    @bitdancer
    Copy link
    Member

    Looks like this should be closed rejected?

    @bitdancer bitdancer added the type-feature A feature request or enhancement label Mar 30, 2012
    @vstinner
    Copy link
    Member Author

    I consider it important that the data block of a string is well-aligned.

    I suppose that it doesn't matter for latin1, but it can be a problem for UCS-2 and UCS-4. There are more drawbacks than advantages, so I agree to close this issue. And let's focus on enabling optimizations based on memory alignement like bpo-14419 :-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants