classification
Title: HTMLParser class is not thread safe
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ale2017, ezio.melotti, serhiy.storchaka
Priority: normal Keywords: easy

Created on 2017-04-07 08:26 by ale2017, last changed 2017-04-15 16:01 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 1140 merged serhiy.storchaka, 2017-04-14 17:44
Messages (5)
msg291256 - (view) Author: Alessandro Vesely (ale2017) * Date: 2017-04-07 08:26
SYMPTOM:
When used in a multithreaded program, instances of a class derived from HTMLParser may convert an entity or leave it alone, in an apparently random fashion.

CAUSE:
The class has a static attribute, entitydefs, which, on first use, is initialized from None to a dictionary of entity definitions.  Initialization is not atomic.  Therefore, instances in concurrent threads assume that initialization is complete and catch a KeyError if the entity at hand hasn't been set yet.  In that case, the entity is left alone as if it were invalid.

WORKAROUND:
class Dummy(HTMLParser):
	"""this class is defined here so that we can initialize its base class"""
	def __init__(self):
		HTMLParser.__init__(self)

# Initialize HTMLParser by loading htmlentitydefs
dummy = Dummy()
dummy.feed('<a href="&amp;">')
del dummy, Dummy
msg291704 - (view) Author: Alessandro Vesely (ale2017) * Date: 2017-04-15 05:46
On Fri 14/Apr/2017 19:44:29 +0200 Serhiy Storchaka wrote:
> 
> Changes by Serhiy Storchaka <storchaka+cpython@gmail.com>:
> 
> 
> ----------
> pull_requests: +1272

Thank you for your fix, Serhiy.  It makes the class behave consistently.
 However, busy processes are going to concurrently build multiple
temporary entitydefs objects before one of them wins, which is probably
worse than the greedy starting that such lazy initialization tries to
avoid in the first place.  Doesn't that design deserve a comment in the
code, at least?

Greetings
Ale
msg291706 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-15 06:12
There is nothing wrong with building entitydefs multiple times since the result is same. An alternative is using locking, but this is more cumbersome solution. And building entitydefs is much faster than importing the threading module.

$ ./python -m timeit -s 'from HTMLParser import HTMLParser; p = HTMLParser()' -- 'HTMLParser.entitydefs = None; p.unescape("&amp;")'
1000 loops, best of 3: 412 usec per loop

$ ./python -m timeit -s 'import sys; m = sys.modules.copy()' -- 'import threading; sys.modules.clear(); sys.modules.update(m)'
100 loops, best of 3: 5.43 msec per loop

Current solution is faster in single-thread case, correct likely fast enough in multi-thread case.
msg291709 - (view) Author: Alessandro Vesely (ale2017) * Date: 2017-04-15 08:21
Serhiy's analysis is correct.  If anything more than a comment is going
to make its way to the code, I'd suggest to move dictionary building to
its own function, so that it can be called either on first use --like
now-- or before threading if the user is concerned.

I agree there is nothing wrong with multiple builds.  My point is just a
minor, bearable inefficiency.  It can be neglected.  Its most annoying
case is probably with test suites, which are more likely to shoot up a
bunch of new threads all at once.

Greetings
Ale
msg291717 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-15 15:35
New changeset 50f948edda0e6465e194ecc50b85fa2646039b8d by Serhiy Storchaka in branch '2.7':
bpo-30011: Fixed race condition in HTMLParser.unescape(). (#1140)
https://github.com/python/cpython/commit/50f948edda0e6465e194ecc50b85fa2646039b8d
History
Date User Action Args
2017-04-15 16:01:18serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: resolved
2017-04-15 15:35:48serhiy.storchakasetmessages: + msg291717
2017-04-15 08:21:05ale2017setmessages: + msg291709
2017-04-15 06:12:21serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg291706
2017-04-15 05:46:47ale2017setmessages: + msg291704
2017-04-14 17:44:29serhiy.storchakasetpull_requests: + pull_request1272
2017-04-07 08:54:19serhiy.storchakasetkeywords: + easy
nosy: + ezio.melotti
2017-04-07 08:26:37ale2017create