classification
Title: Improve utf7 encoder memory usage
Type: enhancement Stage: patch review
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: vstinner Nosy List: ezio.melotti, serhiy.storchaka, vstinner, xiang.zhang
Priority: normal Keywords: patch

Created on 2016-10-25 16:16 by xiang.zhang, last changed 2016-11-18 16:45 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
utf7_encoder.patch xiang.zhang, 2016-10-25 16:33 review
utf7_encoder_v2.patch xiang.zhang, 2016-10-27 17:16 review
Messages (8)
msg279419 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-10-25 16:16
Currently utf7 encoder uses an aggressive memory allocation strategy: use the worst case 8. We can tighten the worst case.

For 1 byte and 2 byte unicodes, the worst case could be 3*n + 2. For 4 byte unicodes, the worst case could be 6*n + 2.

There are 2 cases. First, all characters needs to be encoded, the result length should be upper_round(2.67*n) + 2 <= 3*n + 2. Second, encode and not encode characters appear one by one. For even length, it's 3n < 3n + 2. For odd length, it's exactly 3n + 2.

This won't benefit much when the string is short. But when the string is long, it speeds up.

Without patch:

[bin]$ ./python3 -m perf timeit -s 's = "abc"*10' 's.encode("utf7")'
....................
Median +- std dev: 2.79 us +- 0.09 us
[bin]$ ./python3 -m perf timeit -s 's = "abc"*100' 's.encode("utf7")'
....................
Median +- std dev: 4.55 us +- 0.13 us
[bin]$ ./python3 -m perf timeit -s 's = "abc"*1000' 's.encode("utf7")'
....................
Median +- std dev: 14.0 us +- 0.4 us
[bin]$ ./python3 -m perf timeit -s 's = "abc"*10000' 's.encode("utf7")'
....................
Median +- std dev: 178 us +- 1 us

With patch:

[bin]$ ./python3 -m perf timeit -s 's = "abc"*10' 's.encode("utf7")'
....................
Median +- std dev: 2.87 us +- 0.09 us
[bin]$ ./python3 -m perf timeit -s 's = "abc"*100' 's.encode("utf7")'
....................
Median +- std dev: 4.50 us +- 0.23 us
[bin]$ ./python3 -m perf timeit -s 's = "abc"*1000' 's.encode("utf7")'
....................
Median +- std dev: 13.3 us +- 0.4 us
[bin]$ ./python3 -m perf timeit -s 's = "abc"*10000' 's.encode("utf7")'
....................
Median +- std dev: 102 us +- 1 us

The patch also removes a check, base64bits can only be not 0 when inShift is not 0.
msg279550 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-10-27 17:16
v2 uses _PyBytesWriter so we can use on stack buffer for short string.
msg279556 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-27 18:03
The performance of the UTF-7 codec is not important. Unlikely to other UTF-* encodings this is not standard Unicode encoding. It is used in minority of applications and unlikely is a bottleneck. It is rather in the line of idna and punycode than UTF-8 and UTF-32. Actually I'm going to propose replacing it with Python implementation.

This encoder was omitted form _PyBytesWriter-using optimizations for purpose. The patch complicates the implementation. Since the codec is rarely used some bugs lived long time in it. Any change risks to add new long living bug.
msg279574 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-10-28 03:33
Actually the patch is not going to speed up the encoder but just make the memory allocation strategy better, make the memory upper bound tighter. The speedup is just a good side effect. 

> It is rather in the line of idna and punycode than UTF-8 and UTF-32.

Agree.

> This encoder was omitted form _PyBytesWriter-using optimizations for purpose. The patch complicates the implementation.

Not much. I think the implementation still remains simple.
msg281134 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-18 15:56
Serhiy Storchaka: "The performance of the UTF-7 codec is not important."

Right.


"Actually I'm going to propose replacing it with Python implementation."

Oh. Sadly, PyUnicode_DecodeUTF7() is part of the stable ABI. Do you want to call the Python codec from the C function for backward compatibility?

I dislike UTF-7 because it's complex, but it's not as optimized as the UTF-8 codec, so the code remains not too big and so not too expensive to matain.


"This encoder was omitted form _PyBytesWriter-using optimizations for purpose."

Ah? I don't recall that. When I wrote _PyBytesWriter, I skipped UTF-7 because I don't know well this codec and I preferred to keep the code unchanged to avoid bugs :-)


"The patch complicates the implementation."

Hum, I have to disagree. For me, the patched new is no more complex than the current code. The main change is that it adds code checking the kind to better estimate the output length. It's not hard to understand the link between the Unicode kind of the max_char_size.


I vote +1 on this patch because I consider that it makes the code simpler, not because it makes the codec faster (I don't really care of UTF-7 codec performance).

But again (as in issue #28398), it's up to you Serhiy: I'm also ok to leave the code unchanged if you are against the patch.
msg281143 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-18 16:40
I fixed many long living bugs in the UTF-7 codec in the past, and I remember that we fixed bugs introduced by using _PyUnicodeWriter or _PyBytesWriter many months after changing the code. Since the UTF-7 codec is rarely used, there is a risk of introducing new long living bug. You should peruse not just the code near the changed lines, but all the codec.

I'm not strongly against this patch, if you Victor takes the responsibility for it, I left it on you.
msg281144 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-18 16:44
Oh no, now I'm afraid of breaking something :-D I don't trust anymore our test suite for the UTF-7 codec, so I close the issue :-)

Sorry Xiang, but as we said, this rarely used codec is not important enough to require optimization.
msg281145 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-18 16:45
> I remember that we fixed bugs introduced by using _PyUnicodeWriter or _PyBytesWriter many months after changing the code.

Yeah, now I recall it (vaguely), that's why I closed the bug :-)
History
Date User Action Args
2016-11-18 16:45:40vstinnersetmessages: + msg281145
2016-11-18 16:44:27vstinnersetstatus: open -> closed
resolution: rejected
messages: + msg281144
2016-11-18 16:40:36serhiy.storchakasetassignee: vstinner
messages: + msg281143
2016-11-18 15:56:24vstinnersetmessages: + msg281134
2016-10-28 03:33:28xiang.zhangsetmessages: + msg279574
2016-10-27 18:03:09serhiy.storchakasetmessages: + msg279556
2016-10-27 17:16:14xiang.zhangsetfiles: + utf7_encoder_v2.patch

messages: + msg279550
2016-10-25 16:33:37xiang.zhangsetfiles: + utf7_encoder.patch
2016-10-25 16:33:28xiang.zhangsetfiles: - utf7_encoder.patch
2016-10-25 16:30:00ezio.melottisetnosy: + ezio.melotti
2016-10-25 16:16:46xiang.zhangcreate