Issue 43520: Make Fraction(string) handle non-ascii slashes

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87686

classification

Title:	Make Fraction(string) handle non-ascii slashes
Type:	enhancement	Stage:	resolved
Components:	Library (Lib), Unicode	Versions:	Python 3.7

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, frederic.grosshans, gregory.p.smith, mark.dickinson, rhettinger, serhiy.storchaka, terry.reedy, weightwatchers-carlanderson
Priority:	normal	Keywords:

Created on 2021-03-16 18:09 by weightwatchers-carlanderson, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (16)
msg388865 - (view)	Author: Carl Anderson (weightwatchers-carlanderson)	Date: 2021-03-16 18:09
Fraction works with a regular slash: >>> from fractions import Fraction >>> Fraction("1/2") Fraction(1, 2) but there are other similar slashes such as (0x2044) in which it throws an error: >>> Fraction("0⁄2") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/anaconda3/lib/python3.7/fractions.py", line 138, in __new__ numerator) ValueError: Invalid literal for Fraction: '0⁄2' This seems to come from the (?:/(?P<denom>\d+))? section of the regex _RATIONAL_FORMAT in fractions.py
msg388866 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-16 18:50
There's a bigger issue here about what characters should be accepted in numeric literals. The Unicode minus sign (U+2212) "−" is also not currently accepted for Fractions or any other built-in numeric type. > but there are other similar slashes such as (0x2044) in which it throws an error Do you have a proposal for the set of slashes that should be accepted, or a non-arbitrary rule for determining that set? U+2044 (FRACTION SLASH), U+2215 (DIVISION SLASH) and U+FF0F (FULLWIDTH SOLIDUS) all seem like potential candidates. Are there others?
msg388867 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-16 18:54
Seems worth noting that Unicode fractions like ⅔ produce a FRACTION SLASH character when normalized: >>> unicodedata.normalize('NFKC', '⅔') '2⁄3' >>> list(map(unicodedata.name, unicodedata.normalize('NFKC', '⅔'))) ['DIGIT TWO', 'FRACTION SLASH', 'DIGIT THREE']
msg388869 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-16 19:04
Related: #6632
msg388884 - (view)	Author: Carl Anderson (weightwatchers-carlanderson)	Date: 2021-03-16 21:08
from https://en.wikipedia.org/wiki/Slash_(punctuation) there is U+002F / SOLIDUS U+2044 ⁄ FRACTION SLASH U+2215 ∕ DIVISION SLASH U+29F8 ⧸ BIG SOLIDUS U+FF0F ／ FULLWIDTH SOLIDUS (fullwidth version of solidus) U+1F67C 🙼 VERY HEAVY SOLIDUS In XML and HTML, the slash can also be represented with the character entity / or / or /.[42] there are a couple more listed here: https://unicode-search.net/unicode-namesearch.pl?term=SLASH
msg388886 - (view)	Author: Carl Anderson (weightwatchers-carlanderson)	Date: 2021-03-16 21:20
I guess if we are doing slashes, then the division sign ÷ (U+00F7) should be included too. There are at least 2 minus signs too (U+002D, U+02D7).
msg388892 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-03-16 23:22
I think we should stick the with forward slashes. That is what the rest of the language does. Adding more options is recipe for confusion. >>> 38 / 5 7.6 >>> 38 ∕ 5 SyntaxError: invalid character '∕' (U+2215)
msg389132 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2021-03-20 00:36
I agree with Raymond, at least for now. I would expect the string argument to Fraction to be quoted legal Python code. Without a lot of thought and discussion leading to a change in python design with respect to unicode and operators, this limits '/' to ascii '/'. I believe that we accept non-ascii digits in at least some places, but operators are a different case.
msg389141 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2021-03-20 03:12
Dr Racket supports fraction conversions but insists on a forward slash just like we do. Welcome to DrRacket, version 7.9.0.17--2020-12-24(f6b7f93/a) [cs]. Language: racket, with debugging; memory limit: 128 MB. > (/ 1 2) 1/2 > (string->number "3/5") 3/5 > (string->number "2⁄3") #f
msg389151 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-03-20 08:59
It would be nice to have an utility function in unicodedata to convert Unicode characters to their ASCII equivalents (if they exist). It would allow to explicitly convert all slashes to / (and all digits to 0-9) before passing string to Fraction constructor. AFAIK there is a special Unicode document and tables for this.
msg389152 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2021-03-20 09:10
Carl: can you say more about the problem that motivated this issue?
msg389158 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-03-20 14:41
Usually, constructors try to accept format returned by repr(obj), or even str(obj). It's the case for Fraction: >>> str(fractions.Fraction(1, 2)) '1/2' >>> fractions.Fraction("1/2") Fraction(1, 2) It works as expected. I dislike the idea of trying to handle more Unicode characters which "look like" "/", or characters like "⅔". It sounds like a can of worm, and I don't think that such feature belongs to the stdlib. You can easily write your helper function accepting string and returning a fraction. If someone is motivated to accept more character, I would prefer to have an unified proposition covering all Python number types (int, float, Fraction, complex, etc.) and listing all characters. Maybe a PEP would make sense.
msg389186 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2021-03-20 22:01
The proposal I like is for a unicode numeric normalization functions that return the ascii equivalent to exist. These ideally belong in a third party PyPI library anyways, as they're the kind of thing that needs updating every time a new unicode revision comes out. And there are often multiple cultural interpretations for some symbols, despite any standard, so you'd wind up with a variety of functions and options for which behavior to obtain. That isn't the kind of thing that make for a good stdlib. Doing this by default within the language syntax itself (and thus stdlib constructors) is potentially dangerous and confusing as everything in existence in the world today that processes Python source code already has baked in single-ascii-token assumptions. While parsing and tooling could be evolved for that, it'd be a major ecosystem impacting change.
msg389309 - (view)	Author: Carl Anderson (weightwatchers-carlanderson)	Date: 2021-03-22 12:25
>Carl: can you say more about the problem that motivated this issue? @mark.dickinson I was parsing a large corpus of ingredients strings from web-scraped recipes. My code to interpret strings such as "1/2 cup sugar" would fall over every so often due to this issue as they used fraction slash and other visually similar characters
msg389399 - (view)	Author: Carl Anderson (weightwatchers-carlanderson)	Date: 2021-03-23 18:19
>The proposal I like is for a unicode numeric normalization functions that return the ascii equivalent to exist. @Gregory P. Smith this makes sense to me. That does feel like the cleanest solution. I'm currently doing s = s.replace("⁄","/") but it would be good to have a well-maintained normalization method that contained the all the relevant mappings as an independent preprocess step to Fraction would work well.
msg391776 - (view)	Author: Frédéric Grosshans-André (frederic.grosshans)	Date: 2021-04-24 13:09
@Gregory P. Smith unicodedata.numeric, in the sdandard library, already handles non-Ascii fractions in many scripts. The current “problem” is it outputs a float (even for integers): >>> unicodedata.numeric('⅔') 0.6666666666666666 The UnicodeData.txt file from the Unicode standard it takes its data from, however, contains the corresponding “ascii fractions”. For example, below are two lines of this file for two (very) different ways of encoding two thirds 2154;VULGAR FRACTION TWO THIRDS;No;0;ON;<fraction> 0032 2044 0033;;;2/3;N;FRACTION TWO THIRDS;;;; 1245B;CUNEIFORM NUMERIC SIGN TWO THIRDS DISH;Nl;0;L;;;;2/3;N;;;;; Adding an exact value extraction to unicodedata should be doable, either via an function or an extra keyword to the unicodedata.numeric function. The only information that would be lost (but which is unavailable now anyway) would be for the few codepoints which encode reducible fractions. As of unicode 13.0, these codepoints are * ↉ U+2189 VULGAR FRACTION ZERO THIRDS * 𐧷 U+109F7 MEROITIC CURSIVE FRACTION TWO TWELFTHS * 𐧸 U+109F8 MEROITIC CURSIVE FRACTION THREE TWELFTHS * 𐧹 U+109F9 MEROITIC CURSIVE FRACTION FOUR TWELFTHS * 𐧻 U+109FB MEROITIC CURSIVE FRACTION SIX TWELFTHS * 𐧽 U+109FD MEROITIC CURSIVE FRACTION EIGHT TWELFTHS * 𐧾 U+109FE MEROITIC CURSIVE FRACTION NINE TWELFTHS * 𐧿 U+109FF MEROITIC CURSIVE FRACTION TEN TWELFTHS

History
Date	User	Action	Args
2022-04-11 14:59:42	admin	set	github: 87686
2021-04-27 14:19:43	vstinner	set	nosy: - vstinner
2021-04-24 13:09:34	frederic.grosshans	set	nosy: + frederic.grosshans messages: + msg391776
2021-03-23 18:19:03	weightwatchers-carlanderson	set	messages: + msg389399
2021-03-22 12:25:08	weightwatchers-carlanderson	set	messages: + msg389309
2021-03-20 22:02:11	gregory.p.smith	set	status: open -> closed resolution: rejected stage: resolved
2021-03-20 22:01:36	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg389186
2021-03-20 14:41:27	vstinner	set	messages: + msg389158
2021-03-20 09:10:27	mark.dickinson	set	messages: + msg389152
2021-03-20 08:59:59	serhiy.storchaka	set	nosy: + vstinner, serhiy.storchaka messages: + msg389151 components: + Unicode
2021-03-20 03:12:23	rhettinger	set	messages: + msg389141
2021-03-20 00:36:24	terry.reedy	set	nosy: + terry.reedy messages: + msg389132 title: Fraction only handles regular slashes ("/") and fails with other similar slashes -> Make Fraction(string) handle non-ascii slashes
2021-03-16 23:22:58	rhettinger	set	nosy: + rhettinger messages: + msg388892
2021-03-16 21:20:10	weightwatchers-carlanderson	set	messages: + msg388886
2021-03-16 21:11:22	ezio.melotti	set	nosy: + ezio.melotti
2021-03-16 21:08:59	weightwatchers-carlanderson	set	messages: + msg388884
2021-03-16 19:04:47	mark.dickinson	set	messages: + msg388869
2021-03-16 18:54:43	mark.dickinson	set	messages: + msg388867
2021-03-16 18:50:29	mark.dickinson	set	nosy: + mark.dickinson messages: + msg388866
2021-03-16 18:09:04	weightwatchers-carlanderson	create