> unsigned long long m(unsigned long long a, unsigned long b)
> {
>        return a*b;
> }

I think that's doing a 32 x 64 -> 64 multiplication;  what's being used is 
more like this:

unsigned long long m(unsigned long a, unsigned long b)
    return (unsigned long long)a*b;

which gcc -O3 compiles to:

	pushl	%ebp
	movl	%esp, %ebp
	movl	12(%ebp), %eax
	mull	8(%ebp)
