Issue 41622: Add support for emoji-data.txt and emoji-variation-sequences.txt to unicodedata

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/85788

classification

Title:	Add support for emoji-data.txt and emoji-variation-sequences.txt to unicodedata
Type:	enhancement	Stage:
Components:	Unicode	Versions:	Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, jack1142, terry.reedy
Priority:	normal	Keywords:

Created on 2020-08-23 20:34 by jack1142, last changed 2022-04-11 14:59 by admin.

Messages (2)
msg375826 - (view)	Author: Jakub Kuczys (jack1142) *	Date: 2020-08-23 20:34
`emoji-data.txt` and `emoji-variation-sequences.txt` files were formally pulled into the UCD as of Version 13.0 [1] so I think that unicodedata as a package providing access to UCD could support those as well. In particular: - `emoji-data.txt` lists character properties for emoji characters [2] - `emoji-variation-sequences.txt` lists valid text and emoji presentation sequences [3] Data from `emoji-variation-sequences.txt` can be used to ensure consistent rendering of emoji characters across devices [4] (`StandardizedVariants.txt` has a similar purpose for non-emoji characters). I'm not entirely sure of the use cases for `emoji-data.txt`, but because it's also newly added in UCD 13.0.0, I figured I at least shouldn't omit it when making this issue. [1] https://www.unicode.org/reports/tr44/#Change_History - Changes in Unicode 13.0.0, "Emoji Data" section [2] https://www.unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files [3] https://www.unicode.org/reports/tr51/#Emoji_Variation_Sequences [4] https://unicode.org/faq/vs.html#1
msg376084 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2020-08-29 22:12
Base facts: The Unicode Character Database, UCD, is defined in Tech Report 44, https://www.unicode.org/reports/tr44/. The latest files (now for 13.0) are at https://www.unicode.org/Public/UCD/latest/ and in particular, in the ucd subdirectory. ucd/UnicodeData.txt has a sequential list of current codepoints, including emoji codepoints. Version 13 added subdirectly ucd/emoji with the 2 files listed above. emoji-variation-sequences.txt comprises 177 highly redundant pairs of lines like this: 0023 FE0E ; text style; # (1.1) NUMBER SIGN 0023 FE0F ; emoji style; # (1.1) NUMBER SIGN The only difference between the lines is 'FE0E; text' versus 'FE0F; emoji', 'TEXT PRESENTATION SELECTOR' versus 'EMOJI PRESENTATION SELECTOR'. tr51 does not explicitly say that every line is paired, but perusal suggests that this is true, making the file highly redundant. The 177 characters include some non-emoji symbols, like #, and omits most emoji, including SNAKE, '\U0001f40d', '🐍' (colored coiled snake). And yet, here, at least in Firefox, is the supposedly invalid text snake, '\U0001f40d\ufe0e': '🐍︎' (a flat black-only, uncoiled wiggling snake head). I don't know how '#\ufe0f' might be different from plain '#'. Our UCD copy is accessed via 13 functions in the unicodedata module. Support for the file could consist of a new function, such as 'emoji_text'. The implementation could be 'chr in emoji_text_set', where the latter is the set of 177 characters. But given the accidental experiment above with an unauthorized sequence, I don't know how useful it would be.

History
Date	User	Action	Args
2022-04-11 14:59:35	admin	set	github: 85788
2020-09-01 09:46:27	vstinner	set	nosy: - vstinner
2020-08-29 22:12:46	terry.reedy	set	nosy: + terry.reedy messages: + msg376084
2020-08-23 20:34:50	jack1142	create