This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: entity unescape for sgml/htmllib
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.4
Status: closed Resolution: duplicate
Dependencies: Superseder: expose html.parser.unescape
View: 2927
Assigned To: ezio.melotti Nosy List: BreamoreBoy, ezio.melotti, fdrake, glchapman
Priority: normal Keywords: easy

Created on 2002-02-06 17:55 by glchapman, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Messages (4)
msg61076 - (view) Author: Greg Chapman (glchapman) Date: 2002-02-06 17:55
The parsers defined in htmllib and sgmllib do not 
provide any facilities for unescaping a tag attribute 
which has an embedded html entityref (i.e., they do 
not provide a way to convert "a&b" to "a&b").  The 
parser in HTMLParser unescapes all tag attributes 
automatically.  I'm not sure that's the right approach 
for sgmllib and htmllib (since it might break existing 
code), but it seems to me that one of the modules 
ought to provide a function or method which can do the 
unescaping if needed.  (I'm not familiar with either 
the SGML or the HTML specification, but I assume one 
of them mandates the escaping of '&' (e.g.) in tag 
attributes.  If so, then it seems appropriate for one 
of the modules to provide a function which undoes the 
mandated transformation.)
msg61077 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2006-06-22 03:57
Logged In: YES 

This request is making me reconsider some other changes that
have already been made on the trunk (and are now in 2.5b1).

Reading this, I thought "Doesn't it already do that?"  Turns
out that in Python 2.4, it doesn't.  Both versions handle
this in parsed character data; the difference is confined to
attribute values.

I'd like to propose adding a Boolean configuration attribute
on the parser instance that, when set, causes the parser to
decode entity and character references.  By default, it
would be unset.  This would support backward compatibility
and make it easier to get attribute value decoding.

Another possibility would be to revert the new feature and
add a separate method to perform the decoding.
msg114175 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-17 21:41
Is anyone aware if this was implemented in 2.5 or later as hinted at in msg61077?  If yes please close this.  If no any point in putting this into 3.2?
msg185129 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-24 11:33
See also #2927.
Date User Action Args
2022-04-10 16:04:57adminsetgithub: 36039
2013-11-18 09:54:25ezio.melottisetstatus: open -> closed
assignee: ezio.melotti
superseder: expose html.parser.unescape
resolution: duplicate
stage: test needed -> resolved
2013-03-24 11:33:06ezio.melottisetmessages: + msg185129
versions: + Python 3.4, - Python 3.2
2013-03-23 22:22:01ezio.melottisetnosy: + ezio.melotti
2010-08-17 21:41:06BreamoreBoysetnosy: + BreamoreBoy

messages: + msg114175
versions: + Python 3.2, - Python 2.7
2009-02-12 20:03:12ajaksu2setkeywords: + easy
stage: test needed
versions: + Python 2.7
2002-02-06 17:55:02glchapmancreate