This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Update html.entities.html5 dictionary and parseentities.py
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: Ramchandra Apte, eric.araujo, ezio.melotti, georg.brandl, iuliia.proskurnia, kushal.das, larry, python-dev, terry.reedy
Priority: deferred blocker Keywords: patch

Created on 2012-10-16 09:41 by ezio.melotti, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue16245.diff ezio.melotti, 2012-10-16 11:51 New Tools/scripts/parse_html5_entities.py review
issue16245-2.diff ezio.melotti, 2012-10-23 11:26 review
issue16245-3.diff iuliia.proskurnia, 2012-10-23 12:11 review
Messages (16)
msg173021 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-10-16 09:41
A JSON file containing all the HTML5 entities is now available at http://dev.w3.org/html5/spec/entities.json.
I tested from the interpreter to see if it matches the values in html.entities.html5 and there are a dozen of entities that need to be updated:

>>> s = json.load(open('entities.json'))
>>> from html.entities import html5
>>> for (k1,i1),(k2,i2) in zip(sorted(s.items()), sorted(html5.items())):
...   if i1['characters'] != i2: (k1, k2, i1['characters'], i2, i1['codepoints'], list(map(ord, i2)))
... 
('⃜', 'DotDot;', '⃜', '◌⃜', [8412], [9676, 8412])
('̑', 'DownBreve;', '̑', '◌̑', [785], [9676, 785])
('⟨', 'LeftAngleBracket;', '⟨', '〈', [10216], [9001])
('
', 'NewLine;', '\n', '␊', [10], [9226])
('⟩', 'RightAngleBracket;', '⟩', '〉', [10217], [9002])
('	', 'Tab;', '\t', '␉', [9], [9225])
('⃛', 'TripleDot;', '⃛', '◌⃛', [8411], [9676, 8411])
('⟨', 'lang;', '⟨', '〈', [10216], [9001])
('⟨', 'langle;', '⟨', '〈', [10216], [9001])
('⟩', 'rang;', '⟩', '〉', [10217], [9002])
('⟩', 'rangle;', '⟩', '〉', [10217], [9002])
('⃛', 'tdot;', '⃛', '◌⃛', [8411], [9676, 8411])

The Tools/scripts/parseentities.py script should also be updated (or possibly a new, separate script should be added), so it can be used to generate the html5 dict.  I'm setting this as release blocker so that the update gets done before the release (other values might change in the meanwhile).
msg173345 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-10-19 16:22
I say replace the code.  HTML 4.01 won’t be updated.
msg173589 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-10-23 10:34
I think it's ok to have a separate file rather than patching the existing one (see attached patch).  If the old script is not used anymore it could be removed, otherwise we could just leave it there.
msg173601 - (view) Author: Iuliia Proskurnia (iuliia.proskurnia) Date: 2012-10-23 12:11
Version with --patch to modify Lib/html/entities.py automatically
msg173618 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-10-23 13:46
New changeset dd8b969d7459 by Ezio Melotti in branch 'default':
#16245: add a script to generate the html.entities.html5 dict.
http://hg.python.org/cpython/rev/dd8b969d7459
msg173619 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-10-23 13:54
New changeset 1eb1c6942ac8 by Ezio Melotti in branch '3.3':
#16245: Fix the value of a few entities in html.entities.html5.
http://hg.python.org/cpython/rev/1eb1c6942ac8

New changeset 70fab10cd542 by Ezio Melotti in branch 'default':
#16245: merge with 3.3.
http://hg.python.org/cpython/rev/70fab10cd542
msg173629 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-10-23 18:14
New changeset fb80df16c4ff by Ezio Melotti in branch 'default':
Add Misc/NEWS entry for dd8b969d7459/#16245.
http://hg.python.org/cpython/rev/fb80df16c4ff
msg173631 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-10-23 18:22
I now committed an improved version of the script (thanks Iuliia!) and updated the html.entities.html5 dictionary accordingly.

I'm leaving this open because we will have to check if the dictionary is still updated before the release of Python 3.4.
msg182506 - (view) Author: Ramchandra Apte (Ramchandra Apte) * Date: 2013-02-20 14:19
Shouldn't this be deferred blocker?
msg193983 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2013-07-31 07:15
This is still marked as a release blocker.  I guess this is a "tickler" for Ezio to go check and see if there's a new entities file.

Ezio: can you get this issue closed or downgraded in the next two days?
msg194033 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2013-08-01 08:21
I'm downgrading this to "deferred blocker".  We'll make sure it happens before Python 3.4.0, but there's no need to hold up Python 3.4a1 for this.
msg194527 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-08-06 05:59
I run Tools/scripts/parse_html5_entities.py and it says that "The current dictionary is updated.".  We should check this again, and eventually close the issue, when 3.4 is released.
msg213714 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2014-03-16 05:15
Was this done?  I'm tagging 3.4.0 final soon.
msg213762 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2014-03-16 22:01
I just ran the script:

$ Tools/scripts/parse_html5_entities.py 
The current dictionary is updated.

This is done :‑)
msg213763 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2014-03-16 22:03
BTW this message does not mean that the dictionary was just updated, but that is was already up to date.
msg356281 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-11-09 02:00
According to git blame, the html5 dict in https://github.com/python/cpython/blob/master/Lib/html/entities.py has changed in 7 years.  On the other hand, the standard on which it is based, https://html.spec.whatwg.org/multipage/named-characters.html, was last revised yesterday, and I presume several other times since.  On the third hand, I just ran the update script and there was no change to entities.py, so maybe is has been run with every release.

Should a comment be added to the file listing the unicode source and the update script?
History
Date User Action Args
2022-04-11 14:57:37adminsetgithub: 60449
2019-11-09 02:00:34terry.reedysetnosy: + terry.reedy
messages: + msg356281
2014-03-16 22:03:39eric.araujosetmessages: + msg213763
2014-03-16 22:01:33eric.araujosetstatus: open -> closed
resolution: fixed
messages: + msg213762
2014-03-16 05:15:50larrysetmessages: + msg213714
2013-08-06 05:59:29ezio.melottisetmessages: + msg194527
2013-08-01 08:21:10larrysetpriority: release blocker -> deferred blocker

messages: + msg194033
2013-07-31 07:15:08larrysetmessages: + msg193983
2013-02-20 14:19:58Ramchandra Aptesetnosy: + Ramchandra Apte
messages: + msg182506
2013-02-10 18:27:42pitrousetversions: - Python 3.3
2012-11-01 16:28:07serhiy.storchakasetnosy: - serhiy.storchaka
2012-10-23 18:22:35ezio.melottisetmessages: + msg173631
components: + Library (Lib)
stage: patch review -> resolved
2012-10-23 18:14:54python-devsetmessages: + msg173629
2012-10-23 13:54:36python-devsetmessages: + msg173619
2012-10-23 13:46:42python-devsetnosy: + python-dev
messages: + msg173618
2012-10-23 12:11:48iuliia.proskurniasetfiles: + issue16245-3.diff
nosy: + iuliia.proskurnia
messages: + msg173601

2012-10-23 11:26:01ezio.melottisetfiles: + issue16245-2.diff
2012-10-23 10:34:43ezio.melottisetmessages: + msg173589
stage: needs patch -> patch review
2012-10-19 16:22:10eric.araujosetnosy: + larry, eric.araujo, georg.brandl
messages: + msg173345
2012-10-16 12:00:57serhiy.storchakasetnosy: + serhiy.storchaka
2012-10-16 11:51:41ezio.melottisetfiles: + issue16245.diff
keywords: + patch
2012-10-16 09:57:06kushal.dassetnosy: + kushal.das
2012-10-16 09:41:51ezio.melotticreate