Issue 513840: entity unescape for sgml/htmllib

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/36039

classification

Title:	entity unescape for sgml/htmllib
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.4

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	expose html.parser.unescape View: 2927
Assigned To:	ezio.melotti	Nosy List:	BreamoreBoy, ezio.melotti, fdrake, glchapman
Priority:	normal	Keywords:	easy

Created on 2002-02-06 17:55 by glchapman, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Messages (4)
msg61076 - (view)	Author: Greg Chapman (glchapman)	Date: 2002-02-06 17:55
The parsers defined in htmllib and sgmllib do not provide any facilities for unescaping a tag attribute which has an embedded html entityref (i.e., they do not provide a way to convert "a&b" to "a&b"). The parser in HTMLParser unescapes all tag attributes automatically. I'm not sure that's the right approach for sgmllib and htmllib (since it might break existing code), but it seems to me that one of the modules ought to provide a function or method which can do the unescaping if needed. (I'm not familiar with either the SGML or the HTML specification, but I assume one of them mandates the escaping of '&' (e.g.) in tag attributes. If so, then it seems appropriate for one of the modules to provide a function which undoes the mandated transformation.)
msg61077 - (view)	Author: Fred Drake (fdrake)	Date: 2006-06-22 03:57
Logged In: YES user_id=3066 This request is making me reconsider some other changes that have already been made on the trunk (and are now in 2.5b1). Reading this, I thought "Doesn't it already do that?" Turns out that in Python 2.4, it doesn't. Both versions handle this in parsed character data; the difference is confined to attribute values. I'd like to propose adding a Boolean configuration attribute on the parser instance that, when set, causes the parser to decode entity and character references. By default, it would be unset. This would support backward compatibility and make it easier to get attribute value decoding. Another possibility would be to revert the new feature and add a separate method to perform the decoding.
msg114175 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-17 21:41
Is anyone aware if this was implemented in 2.5 or later as hinted at in msg61077? If yes please close this. If no any point in putting this into 3.2?
msg185129 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-03-24 11:33
See also #2927.

History
Date	User	Action	Args
2022-04-10 16:04:57	admin	set	github: 36039
2013-11-18 09:54:25	ezio.melotti	set	status: open -> closed assignee: ezio.melotti superseder: expose html.parser.unescape resolution: duplicate stage: test needed -> resolved
2013-03-24 11:33:06	ezio.melotti	set	messages: + msg185129 versions: + Python 3.4, - Python 3.2
2013-03-23 22:22:01	ezio.melotti	set	nosy: + ezio.melotti
2010-08-17 21:41:06	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg114175 versions: + Python 3.2, - Python 2.7
2009-02-12 20:03:12	ajaksu2	set	keywords: + easy stage: test needed versions: + Python 2.7
2002-02-06 17:55:02	glchapman	create