This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Small enhancements to Lib/_markupbase.py
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, guido.reina, iritkatriel, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2013-02-11 16:11 by guido.reina, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test.tgz guido.reina, 2013-02-16 10:36 Tests.
issue17183.diff ezio.melotti, 2013-02-22 06:35 review
Messages (9)
msg181903 - (view) Author: Guido Reina (guido.reina) Date: 2013-02-11 16:11
In the file: Lib/_markupbase.py, function: "_parse_doctype_element" there is:

if '>' in rawdata[j:]:
    return rawdata.find(">", j) + 1

rawdata[j:] is being scanned twice.

It would be better to do:
pos = rawdata.find(">", j)
if pos != -1:
    return pos + 1


Same thing in the function: "_parse_doctype_attlist":

if ")" in rawdata[j:]:
    j = rawdata.find(")", j) + 1
else:
    return -1

It would be better to do:
pos = rawdata.find(")", j)
if pos != -1:
    j = pos + 1
else:
    return -1
msg181904 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-02-11 16:18
> if '>' in rawdata[j:]:
>     return rawdata.find(">", j) + 1

See issue17170 for this idiom.
msg182179 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-02-15 21:42
We should add some benchmarks to see if there is any difference between the two forms.
msg182183 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-02-15 22:10
'Enhancement' issues are for visible behavior additions (or occasionally, changes). This is intended to be an invisible small speedup, hence it is a 'performance' issue, and gets a different title.

As explained in #17170, the change will not be a speedup if the substring being looked for is usually not there. The reason is the .find lookup and function call versus the direct syntax. Even if it is faster, I strongly doubt it would be hardly noticeable in the context of this function, which itself is a small piece of parsing an entire document, and it is against our policy to make such micro-optimizations in working code.

The complete block in question Lib/_markupbase.py, 254:7 is

  rawdata = self.rawdata
  if '>' in rawdata[j:]:
    return rawdata.find(">", j) + 1
  return -1

[Ugh. Localizing rawdata negates some of whatever advantage is gained from the double scan.]

If I were to rewrite it, I would replace it with

  try:
    return self.rawdata.index(">", j) + 1
  except ValueError:
    return -1

as better style, and a better example for readers, regardless of micro speed differences. But style-only changes in working code is also against our policy. So I would be closing this if Ezio had not grabbed it ;-).
msg182185 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-02-15 22:28
I would still do a benchmark, for these reasons:
1) IIRC rawdata might be the whole document (or at least everything that has not been parsed yet);
2) the '>' is very likely to be found;

This situation is fairly different from the one presented in #17170, where the strings are shorts and the character is not present in the majority of the strings.

Profiling and improving html.parser (and hence _markupbase) was already on my todo list (even if admittedly not anywhere near the top :), so writing a benchmark for it might be useful for further enhancements too.

(Note: HTMLParser is already fairly fast, parsing ~1.3MB/s according to http://www.crummy.com/2012/02/06/0, but I've never done anything to make it even faster, so there might still be room for improvements.)
msg182215 - (view) Author: Guido Reina (guido.reina) Date: 2013-02-16 10:36
I am attaching a .tgz file with the tests I have performed.

The .tgz file contains also a README.txt file with more detailed information.

I have done the following test:
The script loads the HTML file 'search.html' in 'rawdata' and searches '>' in a loop from the position 'i', being i in: range(len(rawdata)).

with the three variants: "in" + "find" (test1.py), "find" (test2.py), "index" (test3.py).

Result:
Script          First run       Second run      Third run
---------------------------------------------------------
test1.py        2.33            2.32            2.33
test2.py        0.75            0.74            0.76
test3.py        0.75            0.74            0.74


I don't know if the test is representative and whether it helps.
If you think that the test could be improved/changed, just let me know, I will be happy to help.
msg182626 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-02-22 03:46
I did some macro-benchmarks and the proposed changes don't seem to affect the result (most likely because they are in _parse_doctype_element and _parse_doctype_attlist which should be called only once per document).

I did some profiling, and this is the result:
         4437196 function calls (4436748 primitive calls) in 36.582 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    92931    7.400    0.000   17.082    0.000 parser.py:320(parse_starttag)
      202    6.363    0.032   36.281    0.180 parser.py:171(goahead)
   673285    5.302    0.000    5.302    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
   369418    3.272    0.000    4.554    0.000 _markupbase.py:48(updatepos)
    83243    2.698    0.000    4.639    0.000 parser.py:421(parse_endtag)
   308882    2.006    0.000    2.006    0.000 {method 'group' of '_sre.SRE_Match' objects}
   270074    1.521    0.000    1.521    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
    92931    1.150    0.000    2.643    0.000 parser.py:378(check_for_whole_start_tag)
   291079    1.028    0.000    1.028    0.000 {method 'count' of 'str' objects}
   295892    0.883    0.000    0.883    0.000 {method 'startswith' of 'str' objects}
   387439    0.733    0.000    0.733    0.000 {method 'lower' of 'str' objects}
   403922    0.642    0.000    0.642    0.000 {method 'end' of '_sre.SRE_Match' objects}
   124512    0.406    0.000    1.156    0.000 parser.py:504(unescape)
   186775    0.326    0.000    0.326    0.000 {method 'start' of '_sre.SRE_Match' objects}
    96213    0.255    0.000    0.255    0.000 {method 'endswith' of 'str' objects}
    59522    0.253    0.000    0.253    0.000 {method 'rindex' of 'str' objects}
    83226    0.215    0.000    0.215    0.000 parser.py:164(clear_cdata_mode)
     6428    0.194    0.000    0.337    0.000 parser.py:507(replaceEntities)
   106487    0.183    0.000    0.183    0.000 parser.py:484(handle_data)

Excluding string and regex methods, the 3 slowest methods are parse_starttag, goahead, and updatepos.
The attached patch adds a couple of simple optimizations to the first two -- I couldn't think a way to optimize updatepos.
The resulting speedup is however fairly small, so I'm not sure it's worth applying the patch.
I might try doing other benchmarks in future (should I add them somewhere in Tools?).
msg381297 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2020-11-17 22:20
I there anything left to do here or should we close it?
msg381347 - (view) Author: Guido Reina (guido.reina) Date: 2020-11-18 14:30
It can be closed.
History
Date User Action Args
2022-04-11 14:57:41adminsetgithub: 61385
2020-11-18 14:48:55iritkatrielsetstatus: open -> closed
resolution: rejected
stage: needs patch -> resolved
2020-11-18 14:30:26guido.reinasetstatus: pending -> open

messages: + msg381347
2020-11-17 22:20:46iritkatrielsetstatus: open -> pending
nosy: + iritkatriel
messages: + msg381297

2013-02-22 06:35:18ezio.melottisetfiles: + issue17183.diff
keywords: + patch
2013-02-22 03:46:48ezio.melottisetmessages: + msg182626
2013-02-16 10:36:55guido.reinasetfiles: + test.tgz

messages: + msg182215
2013-02-15 22:28:07ezio.melottisettype: enhancement -> performance
messages: + msg182185
2013-02-15 22:10:02terry.reedysetnosy: + terry.reedy
messages: + msg182183
2013-02-15 21:42:02ezio.melottisetmessages: + msg182179
2013-02-11 16:18:46serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg181904
2013-02-11 16:12:27ezio.melottisetversions: - Python 2.6, Python 3.1, Python 2.7, Python 3.2, Python 3.3, Python 3.5
nosy: + ezio.melotti

assignee: ezio.melotti
components: + Library (Lib)
stage: needs patch
2013-02-11 16:11:08guido.reinacreate