Fwiw, I made a trivial benchmark in C that loads aligned and misaligned shorts ( ).  It shows that the memcpy() version takes only 65% of the time taken by the two-bytes-loaded version on a 2010 laptop.  It takes 75% of the time on a modern server.  On a recent little-endian PowerPC machine, 96%.  On aarch64, only 45% faster (i.e. more than twice faster).  This is all with gcc.  It seems that using memcpy() is definitely a win nowadays.
