classification
Title: String performance issue using single quotes
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: SilentGhost, eric.smith, poostenr, steven.daprano, ubehera, vstinner
Priority: normal Keywords:

Created on 2016-01-14 22:47 by poostenr, last changed 2016-01-15 17:01 by SilentGhost. This issue is now closed.

Messages (13)
msg258243 - (view) Author: poostenr (poostenr) Date: 2016-01-14 22:47
There appears to be a significant performance issue between the following two statements. Unable to explain performance impact.

s = "{0},".format(columnvalue)   # fast
s = "'{0}',".format(columnvalue) # ~30x slower

So far, no luck trying to find other statements to improve performance, such as:
s = "\'{0}\',".format(columnvalue)
s = "'" + "%s" %(columnvalue) + "'"+","
s = "{0}{1}{2},".format("'",columnvalue,"'")
msg258247 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-01-14 22:58
Please show us how you're measuring the performance. Also, please show the output of "python -V".
msg258252 - (view) Author: poostenr (poostenr) Date: 2016-01-14 23:55
My initial observations with my Python script using:

s = "{0},".format(columnvalue)   # fast
Processed ~360MB of data from 2:16PM - 2:51PM (35 minutes, ~10MB/min)
One particular file 6.5MB took ~1 minute.

When I changed this line of code to:
s = "'{0}',".format(columnvalue) # ~30x slower (1 min. vs 30 min.)
Same particular file of 6.5MB took ~30 minutes (228KB/min).

My Python environment is:
C:\Data>python -V
Python 3.5.1 :: Anaconda 2.4.1 (32-bit)

I did some more testing with a very simplified piece of code, but is not conclusive. But there is a significant jump when I introduce the single quotes where I see a jump from 0m2.410s to 0m3.875s.

$ python -V
Python 3.5.1

// 
// s='test'
// for x in range(10000000):
//     y = "{0}".format(s)

$ time python test.py

real	0m2.410s
user	0m2.356s
sys	0m0.048s

// s='test'
// for x in range(10000000):
//     y = "'%s'" % (s)

$ time python test2.py 

real	0m2.510s
user	0m2.453s
sys	0m0.051s

// s='test'
// for x in range(10000000):
//     y = "'{0}'".format(s)

$ time python test3.py 

real	0m3.875s
user	0m3.819s
sys	0m0.048s
msg258256 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-01-15 00:29
I see a small difference, but I think it's related to the fact that in the first example you're concatenating 2 strings (',' and the result of {0}), and in the 2nd example it's 3 strings ("'", {0}, "',"):

$ echo '",{0}".format(x)'
",{0}".format(x)
$ python -m timeit -s 'x=4' '",{0}".format(x)'
1000000 loops, best of 3: 0.182 usec per loop

$ echo '"'\''{0}'\'',".format(x)'
"'{0}',".format(x)
$ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)'
1000000 loops, best of 3: 0.205 usec per loop

If you see a factor of 30x difference in your code, I suspect it's not related to str.format(), but some other processing in your code.
msg258260 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-01-15 01:40
I cannot replicate that performance difference under Linux. There's a small difference (about 0.1 second per million iterations, or a tenth of a microsecond) on my computer, but I don't think that's meaningful:

    py> from timeit import Timer
    py> t1 = Timer('"{0},".format(999)')
    py> t2 = Timer('"\'{0}\',".format(999)')
    py> min(t1.repeat())
    4.671058462932706
    py> min(t2.repeat())
    4.774653751403093


Please re-run your tests using the timeit module, and see if you can still see a consistent difference with and without single quotes. Perhaps this is specific to Windows?

Otherwise, I can only suggest that the timing difference is unrelated to the difference in quotes in the script. Are you sure that this is absolutely the only change between the run that took one minute and the run that took 30 minutes? No other changes to the script, running on the same data file, on the same disk, no difference in what other processes are running? (E.g. if one run is fighting for disk access with, say, a memory-hungry anti-virus scan, that would easily explain the difference.)
msg258264 - (view) Author: poostenr (poostenr) Date: 2016-01-15 03:29
Eric, Steven, thank you for your feedback so far.

I am using Windows7, Intel i7.
That one particular file of 6.5MB took ~1 minute on my machine.
When I ran that same test on Linux with Python 3.5.1, it took about 3 seconds. I was amazed to see a 20x difference.

Steven suggested the idea that this phenomenon might be specific to Windows. And I agree, that is what it is looking like. Or is Python doing something in the background?

The Python script is straight forward with a loop that reads a line from a CSV file, split the column values and saves each value as '<value>' to another file. Basically building an SQL statement.
I have had no issues until I added the encapsulating single quotes around the value.

Because I can reproduce this performance difference at will by alternating which line I comment out, leads me to believe it cannot be HDD, AV or something outside the python script interfering.

I repeated the simplified test, that I ran earlier on a Linux system, but this time on my Windows system.
I don't see anything spectacular.
I am just puzzled that using one statement or the other causes such a huge performance impact somehow.

I will try some more tests and copy your examples.

import time
loopcount = 10000000

# Using string value
s="test 1"
v="test 1"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 1: 1452828394523
# End   test 1: 1452828397957
# Diff  test 1: 3434 ms


s="test 2"
v="test 2"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 2: 1452828397957
# End   test 2: 1452828401233
# Diff  test 2: 3276 ms


s="test 3"
v="test 3"
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 3: 1452828401233
# End   test 3: 1452828406320
# Diff  test 3: 5087 ms

# Using integer value
s="test 4"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "{0}".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 4: 1452828406320
# End   test 4: 1452828411378
# Diff  test 4: 5058 ms


s="test 5"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "'%s'" % (v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 5: 1452828411378
# End   test 5: 1452828415264
# Diff  test 5: 3886 ms

s="test 6"
v=123456
start_ms = int(round(time.time() * 1000))
for x in range (loopcount):
    y = "'{0}'".format(v)
end_ms = int(round(time.time() * 1000))
print("Start {0}: {1}".format(s,start_ms))
print("End   {0}: {1}".format(s,end_ms))
print("Diff  {0}: {1} ms\n\n".format(s,end_ms-start_ms))
# Start test 6: 1452828415264
# End   test 6: 1452828421292
# Diff  test 6: 6028 ms
msg258265 - (view) Author: poostenr (poostenr) Date: 2016-01-15 03:43
Eric,

I just tried your examples. 
The loop count is 100x more, but the results are about a factor 10 off.

Test1:
My results:
C:\Data>python -m timeit -s 'x=4' '",{0}".format(x)'
100000000 loops, best of 3: 0.0116 usec per loop 

Eric's results:
$ python -m timeit -s 'x=4' '",{0}".format(x)'
1000000 loops, best of 3: 0.182 usec per loop

Test2:
My results:
C:\Data>python -m timeit -s 'x=4' '"\'{0}\',".format(x)'
100000000 loops, best of 3: 0.0122 usec per loop 

Eric's results:
$ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)'
1000000 loops, best of 3: 0.205 usec per loop
msg258275 - (view) Author: poostenr (poostenr) Date: 2016-01-15 07:56
Eric, Steven,

During further testing I was not able to find any real evidence that the statement I was focused on had a real performance issue.

As I did more testing I noticed that appending data to the file slowed down. The file grew initially with ~30-50KB increments and around 500KB it had slowed down to ~3-5KB/s, until around 1MB the file grew at ~1KB/s. I found this to be odd and because Steven had mentioned other processes, I started looking at some other statements.

After quite a lot of trial and error, I was able to use single quotes and increase my performance to acceptable levels.
Example 3 below is how I resolved it.

Can you explain to me why there was a performance penalty in example 2 ?
Python did something under the hood that I am overlooking.

Did conv.escape_string() change something about columnvalue, so that adding a single quote before and after it introduced some add behavior with writing to file ? I am not an expert on Python and remember reading something about Dynamic typing. 

Example 1: Fast performance, variable s is not encapsulated with single quotes
6.5MB parsed in ~1 minute.
for key in listkeys:
    keyvalue = self.recordstats[key]
    fieldtype   = keyvalue[0]
    columnvalue = record[key]
    columnvalue = conv.escape_string(columnvalue)
    if (count > 1):
        s = "{0},".format(columnvalue)  # No single quotes
    else
        s = "{0},".format(columnvalue)  # No single quotes
    count -= 1
    Append s to file.

Example 2: Slow performance, pre- and post-fixed variable s with single quotes
6.5MB parsed in 35 minutes.
for key in listkeys:
    keyvalue = self.recordstats[key]
    fieldtype   = keyvalue[0]
    columnvalue = record[key]
    columnvalue = conv.escape_string(columnvalue)
    if (count > 1):
        s = "'{0}',".format(columnvalue) # Added single quotes
    else
        s = "'{0}',".format(columnvalue) # Added single quotes
    count -= 1
    Append s to file.

Example 3: Fast performance, variable columnvalue is pre- and post-fixed with single quotes
6.5MB parsed in !45 seconds.
for key in listkeys:
    keyvalue = self.recordstats[key]
    fieldtype   = keyvalue[0]
    columnvalue = record[key]
    columnvalue = conv.escape_string("'" + columnvalue + "'") # Moved single quotes to this statement.
    if (count > 1):
        s = "{0},".format(columnvalue)
    else
        s = "{0},".format(columnvalue)
    count -= 1
    Append s to file.
msg258276 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-15 08:15
I implemented overkill optimization in _PyUnicodeWriter API used by str.format() and str%args. If the result is the input, the string is not copied by value, but by reference.

>>> x="hello"
>>> ("%s" % x) is x
True
>>> ("{}".format(x)) is x
True

If the format string adds something before/after, the string is duplicated:

>>> ("x=%s" % x) is x
False
>>> ("x={}".format(x)) is x
False

The optimization is implemented in _PyUnicodeWriter_WriteStr():

https://hg.python.org/cpython/file/32a4e7b337c9/Objects/unicodeobject.c#l13604
msg258277 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-15 08:24
> If you see a factor of 30x difference in your code, I suspect it's not related to str.format(), but some other processing in your code.

The performance of instructions like ("x=%s" % x) or ("x={}".format(x)) depend on the length of the string. Maybe poostenr has some very long strings?

--

Closed issues related to _PyUnicodeWriter API:

* issue #14716: Use unicode_writer API for str.format()
* issue #14687: Optimize str%tuple for the PEP 393
* issue #14744: Use _PyUnicodeWriter API in str.format() internals
* issue #16147: Rewrite PyUnicode_FromFormatV() to use the _PyUnicodeWriter API
* issue #17694: Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation
* issue #19513: Use PyUnicodeWriter instead of PyAccu in repr(tuple) and repr(list)
* issue #19646: Use PyUnicodeWriter in repr(dict)

I'm listing these issues because they contain microbenchmarks script if you would like to play with them.
msg258287 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2016-01-15 10:43
On Fri, Jan 15, 2016 at 07:56:39AM +0000, poostenr wrote:
> As I did more testing I noticed that appending data to the file slowed 
> down. The file grew initially with ~30-50KB increments and around 
> 500KB it had slowed down to ~3-5KB/s, until around 1MB the file grew 
> at ~1KB/s.

How are you appending to the file? In the code snippets below you merely 
say "append s to file", which is not Python code and could hide a 
multitude of sins. You're asking us to diagnose slow performance in code 
we can't see.

Or perhaps your disk is just badly fragmented, and as the file gets 
bigger, performance gets worse.

> Did conv.escape_string() change something about columnvalue

I don't know. What's conv.escape_string? Where does it come from, and 
what does it do? Again, you're asking us to diagnose code we can't see.

Perhaps you should run the profiler and see where your code is spending 
most of its time.
msg258308 - (view) Author: poostenr (poostenr) Date: 2016-01-15 16:56
Thank you for your feedback Victor and Steven.

I just copied my scripts and 360MB of CSV files over to Linux.
The entire process finished in 4 minutes exactly, using the original python scripts.
So there is something different between my environments.
If it was a fragmentation issue, then I would expect to always have a slow performance on the Windows system. But I can influence the performance by alternating between the two original statements:
s = "{0},".format(columnvalue)   # fast
s = "'{0}',".format(columnvalue) # ~30x slower

I apologize for not being able to provide the entire code.
There is too much code to post at this time.

I am opening a file like this:
#logger = open(filename, rw, buffering, encoding)
logger = open('output.sql', 'a', 1, 'iso-8859-1')

I write to file:
logger.write(text+'\n')

I'm using a library to escape the string before saving to file.
import pymysql.converters as conv
<...>
for key in listkeys:
    keyvalue = self.recordstats[key]
    fieldtype   = keyvalue[0]
    columnvalue = record[key]
    columnvalue = conv.escape_string(columnvalue)
    if (count > 1):
        s = "{0},".format(columnvalue)  # No single quotes
    else
        s = "{0},".format(columnvalue)  # No single quotes
    count -= 1
    logger.write(s+'\n')

I appreciate the feedback and ideas so far.
Trying the profiler is on my list to see if it provides more insight.
I am not using Anaconda3 on Linux. Perhaps that has an impact somehow?

I never suspected inserting the two single quotes to cause such a problem in performance. I noticed it when I parsed ~40GB of data and it took almost a week to complete instead of my expected 6-7 hrs.
Just the other day I decided to remove the single quotes because it was the only thing left that I'd changed. I had discarded that change the past two weeks because that couldn't be causing the performance problem.

Today, I wasn't expecting such a big difference between running my script on Linux or Windows.

If I discover anything else, I will post an update.
When I get the chance I can remove redundant code and post the source.
msg258310 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-01-15 17:01
poostenr, this is demonstrably not a problem with the CPython, which this bug tracker is about. There are few options available on the internet if you need help with your code: mailing lists and irc are among them.
History
Date User Action Args
2016-01-15 17:01:30SilentGhostsetstatus: open -> closed

nosy: + SilentGhost
messages: + msg258310

resolution: not a bug
stage: resolved
2016-01-15 16:56:02poostenrsetmessages: + msg258308
2016-01-15 10:43:29steven.dapranosetmessages: + msg258287
2016-01-15 08:24:01vstinnersetmessages: + msg258277
2016-01-15 08:15:26vstinnersetnosy: + vstinner
messages: + msg258276
2016-01-15 07:56:39poostenrsetmessages: + msg258275
2016-01-15 03:43:00poostenrsetmessages: + msg258265
2016-01-15 03:29:53poostenrsetmessages: + msg258264
2016-01-15 01:40:52steven.dapranosetnosy: + steven.daprano
messages: + msg258260
2016-01-15 00:29:00eric.smithsetmessages: + msg258256
2016-01-14 23:55:40poostenrsetmessages: + msg258252
2016-01-14 22:58:23eric.smithsetnosy: + eric.smith, - paul.moore, tim.golden, zach.ware, steve.dower
messages: + msg258247
components: + Interpreter Core, - Windows
2016-01-14 22:56:23ubeherasetnosy: + ubehera
2016-01-14 22:47:44poostenrcreate