Issue 17183: Small enhancements to Lib/_markupbase.py

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61385

classification

Title:	Small enhancements to Lib/_markupbase.py
Type:	performance	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.4

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	ezio.melotti, guido.reina, iritkatriel, serhiy.storchaka, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2013-02-11 16:11 by guido.reina, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test.tgz	guido.reina, 2013-02-16 10:36	Tests.
issue17183.diff	ezio.melotti, 2013-02-22 06:35		review

Messages (9)
msg181903 - (view)	Author: Guido Reina (guido.reina)	Date: 2013-02-11 16:11
In the file: Lib/_markupbase.py, function: "_parse_doctype_element" there is: if '>' in rawdata[j:]: return rawdata.find(">", j) + 1 rawdata[j:] is being scanned twice. It would be better to do: pos = rawdata.find(">", j) if pos != -1: return pos + 1 Same thing in the function: "_parse_doctype_attlist": if ")" in rawdata[j:]: j = rawdata.find(")", j) + 1 else: return -1 It would be better to do: pos = rawdata.find(")", j) if pos != -1: j = pos + 1 else: return -1
msg181904 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-02-11 16:18
> if '>' in rawdata[j:]: > return rawdata.find(">", j) + 1 See issue17170 for this idiom.
msg182179 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-02-15 21:42
We should add some benchmarks to see if there is any difference between the two forms.
msg182183 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-02-15 22:10
'Enhancement' issues are for visible behavior additions (or occasionally, changes). This is intended to be an invisible small speedup, hence it is a 'performance' issue, and gets a different title. As explained in #17170, the change will not be a speedup if the substring being looked for is usually not there. The reason is the .find lookup and function call versus the direct syntax. Even if it is faster, I strongly doubt it would be hardly noticeable in the context of this function, which itself is a small piece of parsing an entire document, and it is against our policy to make such micro-optimizations in working code. The complete block in question Lib/_markupbase.py, 254:7 is rawdata = self.rawdata if '>' in rawdata[j:]: return rawdata.find(">", j) + 1 return -1 [Ugh. Localizing rawdata negates some of whatever advantage is gained from the double scan.] If I were to rewrite it, I would replace it with try: return self.rawdata.index(">", j) + 1 except ValueError: return -1 as better style, and a better example for readers, regardless of micro speed differences. But style-only changes in working code is also against our policy. So I would be closing this if Ezio had not grabbed it ;-).
msg182185 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-02-15 22:28
I would still do a benchmark, for these reasons: 1) IIRC rawdata might be the whole document (or at least everything that has not been parsed yet); 2) the '>' is very likely to be found; This situation is fairly different from the one presented in #17170, where the strings are shorts and the character is not present in the majority of the strings. Profiling and improving html.parser (and hence _markupbase) was already on my todo list (even if admittedly not anywhere near the top :), so writing a benchmark for it might be useful for further enhancements too. (Note: HTMLParser is already fairly fast, parsing ~1.3MB/s according to http://www.crummy.com/2012/02/06/0, but I've never done anything to make it even faster, so there might still be room for improvements.)
msg182215 - (view)	Author: Guido Reina (guido.reina)	Date: 2013-02-16 10:36
I am attaching a .tgz file with the tests I have performed. The .tgz file contains also a README.txt file with more detailed information. I have done the following test: The script loads the HTML file 'search.html' in 'rawdata' and searches '>' in a loop from the position 'i', being i in: range(len(rawdata)). with the three variants: "in" + "find" (test1.py), "find" (test2.py), "index" (test3.py). Result: Script First run Second run Third run --------------------------------------------------------- test1.py 2.33 2.32 2.33 test2.py 0.75 0.74 0.76 test3.py 0.75 0.74 0.74 I don't know if the test is representative and whether it helps. If you think that the test could be improved/changed, just let me know, I will be happy to help.
msg182626 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-02-22 03:46
I did some macro-benchmarks and the proposed changes don't seem to affect the result (most likely because they are in _parse_doctype_element and _parse_doctype_attlist which should be called only once per document). I did some profiling, and this is the result: 4437196 function calls (4436748 primitive calls) in 36.582 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 92931 7.400 0.000 17.082 0.000 parser.py:320(parse_starttag) 202 6.363 0.032 36.281 0.180 parser.py:171(goahead) 673285 5.302 0.000 5.302 0.000 {method 'match' of '_sre.SRE_Pattern' objects} 369418 3.272 0.000 4.554 0.000 _markupbase.py:48(updatepos) 83243 2.698 0.000 4.639 0.000 parser.py:421(parse_endtag) 308882 2.006 0.000 2.006 0.000 {method 'group' of '_sre.SRE_Match' objects} 270074 1.521 0.000 1.521 0.000 {method 'search' of '_sre.SRE_Pattern' objects} 92931 1.150 0.000 2.643 0.000 parser.py:378(check_for_whole_start_tag) 291079 1.028 0.000 1.028 0.000 {method 'count' of 'str' objects} 295892 0.883 0.000 0.883 0.000 {method 'startswith' of 'str' objects} 387439 0.733 0.000 0.733 0.000 {method 'lower' of 'str' objects} 403922 0.642 0.000 0.642 0.000 {method 'end' of '_sre.SRE_Match' objects} 124512 0.406 0.000 1.156 0.000 parser.py:504(unescape) 186775 0.326 0.000 0.326 0.000 {method 'start' of '_sre.SRE_Match' objects} 96213 0.255 0.000 0.255 0.000 {method 'endswith' of 'str' objects} 59522 0.253 0.000 0.253 0.000 {method 'rindex' of 'str' objects} 83226 0.215 0.000 0.215 0.000 parser.py:164(clear_cdata_mode) 6428 0.194 0.000 0.337 0.000 parser.py:507(replaceEntities) 106487 0.183 0.000 0.183 0.000 parser.py:484(handle_data) Excluding string and regex methods, the 3 slowest methods are parse_starttag, goahead, and updatepos. The attached patch adds a couple of simple optimizations to the first two -- I couldn't think a way to optimize updatepos. The resulting speedup is however fairly small, so I'm not sure it's worth applying the patch. I might try doing other benchmarks in future (should I add them somewhere in Tools?).
msg381297 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2020-11-17 22:20
I there anything left to do here or should we close it?
msg381347 - (view)	Author: Guido Reina (guido.reina)	Date: 2020-11-18 14:30
It can be closed.

History
Date	User	Action	Args
2022-04-11 14:57:41	admin	set	github: 61385
2020-11-18 14:48:55	iritkatriel	set	status: open -> closed resolution: rejected stage: needs patch -> resolved
2020-11-18 14:30:26	guido.reina	set	status: pending -> open messages: + msg381347
2020-11-17 22:20:46	iritkatriel	set	status: open -> pending nosy: + iritkatriel messages: + msg381297
2013-02-22 06:35:18	ezio.melotti	set	files: + issue17183.diff keywords: + patch
2013-02-22 03:46:48	ezio.melotti	set	messages: + msg182626
2013-02-16 10:36:55	guido.reina	set	files: + test.tgz messages: + msg182215
2013-02-15 22:28:07	ezio.melotti	set	type: enhancement -> performance messages: + msg182185
2013-02-15 22:10:02	terry.reedy	set	nosy: + terry.reedy messages: + msg182183
2013-02-15 21:42:02	ezio.melotti	set	messages: + msg182179
2013-02-11 16:18:46	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg181904
2013-02-11 16:12:27	ezio.melotti	set	versions: - Python 2.6, Python 3.1, Python 2.7, Python 3.2, Python 3.3, Python 3.5 nosy: + ezio.melotti assignee: ezio.melotti components: + Library (Lib) stage: needs patch
2013-02-11 16:11:08	guido.reina	create