Title: Enhanced \N{} escapes for Unicode strings
Components: Unicode Versions: Python 3.4
Nosy List: ezio.melotti, mrabarnett, steven.daprano, terry.reedy
Created on 2013-08-01 13:54 by steven.daprano, last changed 2013-08-01 22:04 by terry.reedy.

issue18614.patch mrabarnett, 2013-08-01 16:46
Author: Steven D'Aprano (steven.daprano) Date: 2013-08-01 13:54
As per the discussion here:

\N{} escapes should support the Unicode code point notation U+xxxx (where there are four, five or six hex digits after the U+).

E.g. '\N{U+03BB}' => 'λ'

unicodedata.lookup should also support such numeric names, e.g.:

unicodedata.lookup('U+03BB') => 'λ'

As '+' is otherwise prohibited in Unicode character names, there should never be ambiguity between 'U+xxxx' as a code point and an actual name, and a single lookup function can handle both.

(See for details on characters allowed in names.)

Also add a function for the reverse

unicodedata.codepoint('λ') => 'U+03BB'

def codepoint(c):
    return 'U+{:04X}'.format(ord(c))
Author: Matthew Barnett (mrabarnett) Date: 2013-08-01 16:46
I've attached a patch for this.
Author: Terry J. Reedy (terry.reedy) Date: 2013-08-01 22:04
I agree with the proposal.

Some of the code seems redundant with code we already have.
In Python, I would write

def codepoint_from_U_notation(name, namelen):
  if not (4 <= namelen <= 6): raise <wrong length>
  return chr(int(name, 16))

maybe with try-except to re-write error messages like
ValueError: invalid literal for int() with base 16: '99x3'
ValueError: chr() arg not in range(0x110000)

My point is that we already have code to convert hex strings to int; I presume PyUnicode_FromOrdinal(code) is the C version of 'chr' that already checks the max value.
