Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose html.parser.unescape #47176

Closed
thomaspinckney3 mannequin opened this issue May 20, 2008 · 14 comments
Closed

expose html.parser.unescape #47176

thomaspinckney3 mannequin opened this issue May 20, 2008 · 14 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@thomaspinckney3
Copy link
Mannequin

thomaspinckney3 mannequin commented May 20, 2008

BPO 2927
Nosy @brettcannon, @vstinner, @ezio-melotti, @rnk, @vadmium, @serhiy-storchaka
Files
  • unescape.diff
  • unescape.diff
  • issue2927.diff
  • issue2927-2.diff
  • issue2927-3.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2013-11-19.18:51:50.357>
    created_at = <Date 2008-05-20.05:43:53.602>
    labels = ['type-feature', 'library']
    title = 'expose html.parser.unescape'
    updated_at = <Date 2013-11-19.18:51:50.356>
    user = 'https://bugs.python.org/thomaspinckney3'

    bugs.python.org fields:

    activity = <Date 2013-11-19.18:51:50.356>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2013-11-19.18:51:50.357>
    closer = 'ezio.melotti'
    components = ['Library (Lib)']
    creation = <Date 2008-05-20.05:43:53.602>
    creator = 'thomaspinckney3'
    dependencies = []
    files = ['10383', '19550', '32682', '32700', '32703']
    hgrepos = []
    issue_num = 2927
    keywords = ['patch', 'needs review']
    message_count = 14.0
    messages = ['67102', '67103', '110636', '110657', '111461', '111566', '120269', '120826', '203341', '203369', '203403', '203410', '203413', '203415']
    nosy_count = 9.0
    nosy_names = ['brett.cannon', 'vstinner', 'ezio.melotti', 'thomaspinckney3', 'rnk', 'BreamoreBoy', 'python-dev', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue2927'
    versions = ['Python 3.4']

    @thomaspinckney3
    Copy link
    Mannequin Author

    thomaspinckney3 mannequin commented May 20, 2008

    There is currently a private method inside of html.parser.HTMLParser to
    unescape HTML &...; style escapes. This would be useful to expose for
    other users who want to unescape a piece of HTML.

    Additionally, many websites don't use proper unicode or iso-8859-1
    encodings and accidentally use Microsoft Code Page 1252 extensions. I
    added code to map these to their appropriate unicode values.

    The unescaping logic was slightly simplified too.

    This is my first Python patch submission, so please let me know if I've
    done anything wrong.

    A new test case was also added for this functionality.

    @thomaspinckney3 thomaspinckney3 mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels May 20, 2008
    @brettcannon
    Copy link
    Member

    The plan is to add html.escape(). Adding html.unescape() wouldn't hurt.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 18, 2010

    Trying to run the test and I get:-

    c:\py3k\Lib>..\PCbuild\python_d.exe test\test_htmlparser.py
    File "test\test_htmlparser.py", line 326
    escaped = u"<p>There�s the Côte</p>"
    ^
    SyntaxError: invalid syntax

    @rnk
    Copy link
    Mannequin

    rnk mannequin commented Jul 18, 2010

    It's using the old Python 2 unicode string literal syntax.

    It also doesn't keep to 80 cols.

    I'd also rather continue using a lazily initialized dict instead of catching a KeyError for '.

    I also feel that with the changes to Unicode in py3k, the cp1252 stuff won't work as desired and should be cut.

    ===

    Is anyone still interested in html.unescape or html.escape anyway? Every web framework seems to have their own support routines already. Otherwise I'd recommend close -> wontfix.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 24, 2010

    msg110657 recommends close -> wontfix. Does anybody want this kept open or can it be closed?

    @vstinner
    Copy link
    Member

    I'm not sure that using an hardcoded mapping CP1252 => unicode is a good idea.

    @thomaspinckney3
    Copy link
    Mannequin Author

    thomaspinckney3 mannequin commented Nov 2, 2010

    I don't think Django includes an HTML unescape. I'm not familiar with other frameworks. So I'd still find this useful to include in the stdlib.

    @thomaspinckney3
    Copy link
    Mannequin Author

    thomaspinckney3 mannequin commented Nov 9, 2010

    New patch attached, tested against Python 3.2. This is my first Python patch so apologies if I've done something wrong here. Feedback appreciated!

    Changes:

    • fit everything to 80 cols
    • just made changes to the HTMLParser.unescape function instead of providing a standalone unescape function
    • fixed test case to fix string literals to work in python 3k
    • left the cp1252 hacks in there since it looks like they work still, but if there's a problem with them let me know. In practice I have to this at work in order to make unescaping actual web pages work.

    @ezio-melotti ezio-melotti self-assigned this Nov 18, 2013
    @serhiy-storchaka
    Copy link
    Member

    I added comments on Rietveld.

    Yet one thing. For now the html module is very simple and has no dependencies. The patch adds an import of re and html.escapes and relative heavy re.compile operations. Due to the fact that the html module is implicitly imported when any of the html submodules is imported, this can affect a code which doesn't use unescape(). However a cure for this problem (lazy import and initialization) may be worse than the problem itself, perhaps we should live with it.

    @ezio-melotti
    Copy link
    Member

    Here's an updated patch that addresses comments on rietveld and adds a few more tests and docs.
    I should also update the what's new, but I have other upcoming changes in the html package so I'll probably do it at the end.

    Regarding your concern:

    • if people are only using html.escape, then they will get a couple of extra imports, including all the html5 entities, and a re.compile;
    • if people are using html.parser, they already have plenty of re.compiles there, and soon html.parser will use unescape too;
    • if people are using html.entities they only get an extra re.compile;

    Overall I don't think it's a big problem.

    As a side node, the "if '&' in s:" in the unescape function could be removed -- I'm not sure it brings any real advantage. This could/should be proved by benchmarks.

    @ezio-melotti
    Copy link
    Member

    Here is the last iteration with a few minor tweaks and a couple more tests.

    @serhiy-storchaka
    Copy link
    Member

    LGTM.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 19, 2013

    New changeset 7b9235852b3b by Ezio Melotti in branch 'default':
    bpo-2927: Added the unescape() function to the html module.
    http://hg.python.org/cpython/rev/7b9235852b3b

    @ezio-melotti
    Copy link
    Member

    Fixed, thanks for the reviews!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants