Issue 26118: String performance issue using single quotes

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70306

classification

Title:	String performance issue using single quotes
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.5

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	SilentGhost, eric.smith, poostenr, steven.daprano, ubehera, vstinner
Priority:	normal	Keywords:

Created on 2016-01-14 22:47 by poostenr, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (13)
msg258243 - (view)	Author: poostenr (poostenr)	Date: 2016-01-14 22:47
There appears to be a significant performance issue between the following two statements. Unable to explain performance impact. s = "{0},".format(columnvalue) # fast s = "'{0}',".format(columnvalue) # ~30x slower So far, no luck trying to find other statements to improve performance, such as: s = "\'{0}\',".format(columnvalue) s = "'" + "%s" %(columnvalue) + "'"+"," s = "{0}{1}{2},".format("'",columnvalue,"'")
msg258247 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2016-01-14 22:58
Please show us how you're measuring the performance. Also, please show the output of "python -V".
msg258252 - (view)	Author: poostenr (poostenr)	Date: 2016-01-14 23:55
My initial observations with my Python script using: s = "{0},".format(columnvalue) # fast Processed ~360MB of data from 2:16PM - 2:51PM (35 minutes, ~10MB/min) One particular file 6.5MB took ~1 minute. When I changed this line of code to: s = "'{0}',".format(columnvalue) # ~30x slower (1 min. vs 30 min.) Same particular file of 6.5MB took ~30 minutes (228KB/min). My Python environment is: C:\Data>python -V Python 3.5.1 :: Anaconda 2.4.1 (32-bit) I did some more testing with a very simplified piece of code, but is not conclusive. But there is a significant jump when I introduce the single quotes where I see a jump from 0m2.410s to 0m3.875s. $ python -V Python 3.5.1 // // s='test' // for x in range(10000000): // y = "{0}".format(s) $ time python test.py real 0m2.410s user 0m2.356s sys 0m0.048s // s='test' // for x in range(10000000): // y = "'%s'" % (s) $ time python test2.py real 0m2.510s user 0m2.453s sys 0m0.051s // s='test' // for x in range(10000000): // y = "'{0}'".format(s) $ time python test3.py real 0m3.875s user 0m3.819s sys 0m0.048s
msg258256 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2016-01-15 00:29
I see a small difference, but I think it's related to the fact that in the first example you're concatenating 2 strings (',' and the result of {0}), and in the 2nd example it's 3 strings ("'", {0}, "',"): $ echo '",{0}".format(x)' ",{0}".format(x) $ python -m timeit -s 'x=4' '",{0}".format(x)' 1000000 loops, best of 3: 0.182 usec per loop $ echo '"'\''{0}'\'',".format(x)' "'{0}',".format(x) $ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)' 1000000 loops, best of 3: 0.205 usec per loop If you see a factor of 30x difference in your code, I suspect it's not related to str.format(), but some other processing in your code.
msg258260 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2016-01-15 01:40
I cannot replicate that performance difference under Linux. There's a small difference (about 0.1 second per million iterations, or a tenth of a microsecond) on my computer, but I don't think that's meaningful: py> from timeit import Timer py> t1 = Timer('"{0},".format(999)') py> t2 = Timer('"\'{0}\',".format(999)') py> min(t1.repeat()) 4.671058462932706 py> min(t2.repeat()) 4.774653751403093 Please re-run your tests using the timeit module, and see if you can still see a consistent difference with and without single quotes. Perhaps this is specific to Windows? Otherwise, I can only suggest that the timing difference is unrelated to the difference in quotes in the script. Are you sure that this is absolutely the only change between the run that took one minute and the run that took 30 minutes? No other changes to the script, running on the same data file, on the same disk, no difference in what other processes are running? (E.g. if one run is fighting for disk access with, say, a memory-hungry anti-virus scan, that would easily explain the difference.)
msg258264 - (view)	Author: poostenr (poostenr)	Date: 2016-01-15 03:29
Eric, Steven, thank you for your feedback so far. I am using Windows7, Intel i7. That one particular file of 6.5MB took ~1 minute on my machine. When I ran that same test on Linux with Python 3.5.1, it took about 3 seconds. I was amazed to see a 20x difference. Steven suggested the idea that this phenomenon might be specific to Windows. And I agree, that is what it is looking like. Or is Python doing something in the background? The Python script is straight forward with a loop that reads a line from a CSV file, split the column values and saves each value as '<value>' to another file. Basically building an SQL statement. I have had no issues until I added the encapsulating single quotes around the value. Because I can reproduce this performance difference at will by alternating which line I comment out, leads me to believe it cannot be HDD, AV or something outside the python script interfering. I repeated the simplified test, that I ran earlier on a Linux system, but this time on my Windows system. I don't see anything spectacular. I am just puzzled that using one statement or the other causes such a huge performance impact somehow. I will try some more tests and copy your examples. import time loopcount = 10000000 # Using string value s="test 1" v="test 1" start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "{0}".format(v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 1: 1452828394523 # End test 1: 1452828397957 # Diff test 1: 3434 ms s="test 2" v="test 2" start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "'%s'" % (v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 2: 1452828397957 # End test 2: 1452828401233 # Diff test 2: 3276 ms s="test 3" v="test 3" start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "'{0}'".format(v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 3: 1452828401233 # End test 3: 1452828406320 # Diff test 3: 5087 ms # Using integer value s="test 4" v=123456 start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "{0}".format(v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 4: 1452828406320 # End test 4: 1452828411378 # Diff test 4: 5058 ms s="test 5" v=123456 start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "'%s'" % (v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 5: 1452828411378 # End test 5: 1452828415264 # Diff test 5: 3886 ms s="test 6" v=123456 start_ms = int(round(time.time() * 1000)) for x in range (loopcount): y = "'{0}'".format(v) end_ms = int(round(time.time() * 1000)) print("Start {0}: {1}".format(s,start_ms)) print("End {0}: {1}".format(s,end_ms)) print("Diff {0}: {1} ms\n\n".format(s,end_ms-start_ms)) # Start test 6: 1452828415264 # End test 6: 1452828421292 # Diff test 6: 6028 ms
msg258265 - (view)	Author: poostenr (poostenr)	Date: 2016-01-15 03:43
Eric, I just tried your examples. The loop count is 100x more, but the results are about a factor 10 off. Test1: My results: C:\Data>python -m timeit -s 'x=4' '",{0}".format(x)' 100000000 loops, best of 3: 0.0116 usec per loop Eric's results: $ python -m timeit -s 'x=4' '",{0}".format(x)' 1000000 loops, best of 3: 0.182 usec per loop Test2: My results: C:\Data>python -m timeit -s 'x=4' '"\'{0}\',".format(x)' 100000000 loops, best of 3: 0.0122 usec per loop Eric's results: $ python -m timeit -s 'x=4' '"'\''{0}'\'',".format(x)' 1000000 loops, best of 3: 0.205 usec per loop
msg258275 - (view)	Author: poostenr (poostenr)	Date: 2016-01-15 07:56
Eric, Steven, During further testing I was not able to find any real evidence that the statement I was focused on had a real performance issue. As I did more testing I noticed that appending data to the file slowed down. The file grew initially with ~30-50KB increments and around 500KB it had slowed down to ~3-5KB/s, until around 1MB the file grew at ~1KB/s. I found this to be odd and because Steven had mentioned other processes, I started looking at some other statements. After quite a lot of trial and error, I was able to use single quotes and increase my performance to acceptable levels. Example 3 below is how I resolved it. Can you explain to me why there was a performance penalty in example 2 ? Python did something under the hood that I am overlooking. Did conv.escape_string() change something about columnvalue, so that adding a single quote before and after it introduced some add behavior with writing to file ? I am not an expert on Python and remember reading something about Dynamic typing. Example 1: Fast performance, variable s is not encapsulated with single quotes 6.5MB parsed in ~1 minute. for key in listkeys: keyvalue = self.recordstats[key] fieldtype = keyvalue[0] columnvalue = record[key] columnvalue = conv.escape_string(columnvalue) if (count > 1): s = "{0},".format(columnvalue) # No single quotes else s = "{0},".format(columnvalue) # No single quotes count -= 1 Append s to file. Example 2: Slow performance, pre- and post-fixed variable s with single quotes 6.5MB parsed in 35 minutes. for key in listkeys: keyvalue = self.recordstats[key] fieldtype = keyvalue[0] columnvalue = record[key] columnvalue = conv.escape_string(columnvalue) if (count > 1): s = "'{0}',".format(columnvalue) # Added single quotes else s = "'{0}',".format(columnvalue) # Added single quotes count -= 1 Append s to file. Example 3: Fast performance, variable columnvalue is pre- and post-fixed with single quotes 6.5MB parsed in !45 seconds. for key in listkeys: keyvalue = self.recordstats[key] fieldtype = keyvalue[0] columnvalue = record[key] columnvalue = conv.escape_string("'" + columnvalue + "'") # Moved single quotes to this statement. if (count > 1): s = "{0},".format(columnvalue) else s = "{0},".format(columnvalue) count -= 1 Append s to file.
msg258276 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-01-15 08:15
I implemented overkill optimization in _PyUnicodeWriter API used by str.format() and str%args. If the result is the input, the string is not copied by value, but by reference. >>> x="hello" >>> ("%s" % x) is x True >>> ("{}".format(x)) is x True If the format string adds something before/after, the string is duplicated: >>> ("x=%s" % x) is x False >>> ("x={}".format(x)) is x False The optimization is implemented in _PyUnicodeWriter_WriteStr(): https://hg.python.org/cpython/file/32a4e7b337c9/Objects/unicodeobject.c#l13604
msg258277 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-01-15 08:24
> If you see a factor of 30x difference in your code, I suspect it's not related to str.format(), but some other processing in your code. The performance of instructions like ("x=%s" % x) or ("x={}".format(x)) depend on the length of the string. Maybe poostenr has some very long strings? -- Closed issues related to _PyUnicodeWriter API: * issue #14716: Use unicode_writer API for str.format() * issue #14687: Optimize str%tuple for the PEP 393 * issue #14744: Use _PyUnicodeWriter API in str.format() internals * issue #16147: Rewrite PyUnicode_FromFormatV() to use the _PyUnicodeWriter API * issue #17694: Enhance _PyUnicodeWriter API to control minimum buffer length without overallocation * issue #19513: Use PyUnicodeWriter instead of PyAccu in repr(tuple) and repr(list) * issue #19646: Use PyUnicodeWriter in repr(dict) I'm listing these issues because they contain microbenchmarks script if you would like to play with them.
msg258287 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2016-01-15 10:43
On Fri, Jan 15, 2016 at 07:56:39AM +0000, poostenr wrote: > As I did more testing I noticed that appending data to the file slowed > down. The file grew initially with ~30-50KB increments and around > 500KB it had slowed down to ~3-5KB/s, until around 1MB the file grew > at ~1KB/s. How are you appending to the file? In the code snippets below you merely say "append s to file", which is not Python code and could hide a multitude of sins. You're asking us to diagnose slow performance in code we can't see. Or perhaps your disk is just badly fragmented, and as the file gets bigger, performance gets worse. > Did conv.escape_string() change something about columnvalue I don't know. What's conv.escape_string? Where does it come from, and what does it do? Again, you're asking us to diagnose code we can't see. Perhaps you should run the profiler and see where your code is spending most of its time.
msg258308 - (view)	Author: poostenr (poostenr)	Date: 2016-01-15 16:56
Thank you for your feedback Victor and Steven. I just copied my scripts and 360MB of CSV files over to Linux. The entire process finished in 4 minutes exactly, using the original python scripts. So there is something different between my environments. If it was a fragmentation issue, then I would expect to always have a slow performance on the Windows system. But I can influence the performance by alternating between the two original statements: s = "{0},".format(columnvalue) # fast s = "'{0}',".format(columnvalue) # ~30x slower I apologize for not being able to provide the entire code. There is too much code to post at this time. I am opening a file like this: #logger = open(filename, rw, buffering, encoding) logger = open('output.sql', 'a', 1, 'iso-8859-1') I write to file: logger.write(text+'\n') I'm using a library to escape the string before saving to file. import pymysql.converters as conv <...> for key in listkeys: keyvalue = self.recordstats[key] fieldtype = keyvalue[0] columnvalue = record[key] columnvalue = conv.escape_string(columnvalue) if (count > 1): s = "{0},".format(columnvalue) # No single quotes else s = "{0},".format(columnvalue) # No single quotes count -= 1 logger.write(s+'\n') I appreciate the feedback and ideas so far. Trying the profiler is on my list to see if it provides more insight. I am not using Anaconda3 on Linux. Perhaps that has an impact somehow? I never suspected inserting the two single quotes to cause such a problem in performance. I noticed it when I parsed ~40GB of data and it took almost a week to complete instead of my expected 6-7 hrs. Just the other day I decided to remove the single quotes because it was the only thing left that I'd changed. I had discarded that change the past two weeks because that couldn't be causing the performance problem. Today, I wasn't expecting such a big difference between running my script on Linux or Windows. If I discover anything else, I will post an update. When I get the chance I can remove redundant code and post the source.
msg258310 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2016-01-15 17:01
poostenr, this is demonstrably not a problem with the CPython, which this bug tracker is about. There are few options available on the internet if you need help with your code: mailing lists and irc are among them.

History
Date	User	Action	Args
2022-04-11 14:58:26	admin	set	github: 70306
2016-01-15 17:01:30	SilentGhost	set	status: open -> closed nosy: + SilentGhost messages: + msg258310 resolution: not a bug stage: resolved
2016-01-15 16:56:02	poostenr	set	messages: + msg258308
2016-01-15 10:43:29	steven.daprano	set	messages: + msg258287
2016-01-15 08:24:01	vstinner	set	messages: + msg258277
2016-01-15 08:15:26	vstinner	set	nosy: + vstinner messages: + msg258276
2016-01-15 07:56:39	poostenr	set	messages: + msg258275
2016-01-15 03:43:00	poostenr	set	messages: + msg258265
2016-01-15 03:29:53	poostenr	set	messages: + msg258264
2016-01-15 01:40:52	steven.daprano	set	nosy: + steven.daprano messages: + msg258260
2016-01-15 00:29:00	eric.smith	set	messages: + msg258256
2016-01-14 23:55:40	poostenr	set	messages: + msg258252
2016-01-14 22:58:23	eric.smith	set	nosy: + eric.smith, - paul.moore, tim.golden, zach.ware, steve.dower messages: + msg258247 components: + Interpreter Core, - Windows
2016-01-14 22:56:23	ubehera	set	nosy: + ubehera
2016-01-14 22:47:44	poostenr	create