Author serhiy.storchaka
Recipients azsorkin, methane, ned.deily, pitrou, python-dev, serhiy.storchaka, vstinner, xiang.zhang
Date 2017-03-29.07:11:23
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1490771483.63.0.799608509345.issue24821@psf.upfronthosting.co.za>
In-reply-to
Content
Ukrainian "Є" is not the only character suffered from this issue. I suppose much CJK characters are suffered too. The problem is occurred when the lower byte of searched character matches the upper byte of most  characters in the text. For example, searching "乎" (U+4e4e) in the sequence of characters U+4eXX (but not containing U+4e4e and U+4e4f):

$ ./python -m perf timeit -s 's = "一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争 亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜 仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿"*100' -- 's.find("乎")'

Unpatched:  Median +- std dev: 761 us +- 108 us
Patched:    Median +- std dev: 117 us +- 9 us

For comparison, searching "乏" (U+4e4f) in the same sequence:

$ ./python -m perf timeit -s 's = "一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争 亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜 仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿"*100' -- 's.find("乏")'

Unpatched:  Median +- std dev: 12.8 us +- 1.4 us
Patched:    Median +- std dev: 12.6 us +- 1.7 us

Sorry, I don't know Chinese or Japanese and can't provide more realistic examples.
History
Date User Action Args
2017-03-29 07:11:23serhiy.storchakasetrecipients: + serhiy.storchaka, pitrou, vstinner, ned.deily, methane, python-dev, azsorkin, xiang.zhang
2017-03-29 07:11:23serhiy.storchakasetmessageid: <1490771483.63.0.799608509345.issue24821@psf.upfronthosting.co.za>
2017-03-29 07:11:23serhiy.storchakalinkissue24821 messages
2017-03-29 07:11:23serhiy.storchakacreate