This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unicode word boundries
Type: behavior Stage: resolved
Components: Regular Expressions Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: SilentGhost, ezio.melotti, mrabarnett, revo
Priority: normal Keywords:

Created on 2016-08-27 14:36 by revo, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (2)
msg273782 - (view) Author: mohammad (revo) Date: 2016-08-27 14:36
According to [UAX #29](http://unicode.org/reports/tr29) - unicode word boundaries (rule WB5a), an apostrophe includes U+0027 ( ' ) APOSTROPHE and U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK (curly apostrophe).

However regex module only implements U+0027 and the second kind (U+2019) is missing:

/* Break between apostrophe and vowels (French, Italian). */
/* WB5a */
if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&
    is_unicode_vowel(char_at(state->text, text_pos)))
        return TRUE;


[Source code](https://bitbucket.org/mrabarnett/mrab-regex/src/f21447bf288780d8dd9b1633820480484ce8f677/regex_3/regex/_regex.c?at=default&fileviewer=file-view-default#_regex.c-1657)
msg273783 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-08-27 14:56
regex module is not in standard library, on the latest 3.6 branch re module breaks on curly apostrophe just fine. Perhaps, try reporting this issue on the bitbucket tracker?
History
Date User Action Args
2022-04-11 14:58:35adminsetgithub: 72065
2016-08-27 14:56:48SilentGhostsetstatus: open -> closed

versions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6
nosy: + SilentGhost

messages: + msg273783
resolution: not a bug
stage: resolved
2016-08-27 14:36:09revocreate