This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author jeethu
Recipients jeethu, pitrou, rhettinger, serhiy.storchaka, vstinner
Date 2018-01-17.13:36:07
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1516196167.31.0.467229070634.issue32534@psf.upfronthosting.co.za>
In-reply-to
Content
> FWIW, we've encountered a number of situations in the past when something that improved the timings on one compiler would make timings worse on another compiler.  There was also variance between timings on 32-bit builds versus 64-bit builds.

I've verified that both clang and gcc generate similar assembly (memcpy is not inlined and the loop is not vectorized) 32-bit mode. I'd wager that the improvement with vectorization (in memmove) would be even more pronounced on 32-bit systems, given that pointers are half the size and cache lines are still 64 bytes wide.

> It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an optimization should make a *microbenchmark* at least 10% faster.

That's true if we assume lists to have 100 or lesser elements. On the other hand, on the pyperformance comparison I'd posted yesterday[1], there seems to be an average improvement of 1.27x  on the first seven benchmarks, and the slowest slowdown is only 1.03x. Albeit, the improvement cannot be better than by a constant factor with the vectorized loop in memmove.

> Using memmove() for large copy is a good idea. The main question is the "if (n <= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()?

The overhead of calling memmove makes it slower for small lists. That's how I arrived at this patch in the first place. I tried replacing the loop with a memmove() and it was slower on pyperformance and it was faster with switching to memmove() after a threshold.

> Previously, Python had a Py_MEMCPY() macro which also had such threshold. Basically, it's a workaround for compiler performance issues:

That's very interesting! I think it boils down to the pointer aliasing problem. The pointers in memcpy()'s signature have the `restrict` qualifier, which gives the compiler more leeway to optimize calls to memcpy, while the compiler has to be more conservative with memmove(). I wonder if it's worth trying out a Py_MEMMOVE() :)


[1]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030
History
Date User Action Args
2018-01-17 13:36:07jeethusetrecipients: + jeethu, rhettinger, pitrou, vstinner, serhiy.storchaka
2018-01-17 13:36:07jeethusetmessageid: <1516196167.31.0.467229070634.issue32534@psf.upfronthosting.co.za>
2018-01-17 13:36:07jeethulinkissue32534 messages
2018-01-17 13:36:07jeethucreate