Spelling out the numerator/denominator calculation in the __mod__ special method, and actually implementing __divmod__, speeds up both operations by 23x. This is due to avoiding repeated Fraction instantiation and normalisation, as well as less arithmetic operations.
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'a%b'
50000 loops, best of 5: 9.53 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'a%3'
50000 loops, best of 5: 6.61 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'divmod(a, b)'
20000 loops, best of 5: 14.1 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'divmod(a, 3)'
20000 loops, best of 5: 10.2 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'a%b'
100000 loops, best of 5: 2.96 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'a%3'
100000 loops, best of 5: 2.78 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'divmod(a, b)'
100000 loops, best of 5: 3.93 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'divmod(a, 3)'
50000 loops, best of 5: 3.82 usec per loop

Thanks for your review and ideas, Serhiy. I added a couple of test cases, but failed to find any case where the new implementation is not much faster.
I also tried "divmod(n_div, d_div)" for implementing __divmod__(), and the results are mixed, e.g.
Arithmetic operators:
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'divmod(a,b)'
100000 loops, best of 5: 3.11 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**15+1, 10**27+1), F(10**91, 10**71)' 'divmod(a, b)'
100000 loops, best of 5: 3.48 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**350+1, 10**207+1), F(10**891, 10**621)' 'divmod(a, b)'
20000 loops, best of 5: 17.7 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**891, 10**6111), F(10**350+1, 10**207+1)' 'divmod(a, b)'
20000 loops, best of 5: 18.2 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**350+1, 10**207+1), F(10**891, 10**6121)' 'divmod(a, b)'
5000 loops, best of 5: 34.4 usec per loop
divmod():
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'divmod(a,b)'
100000 loops, best of 5: 3.04 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**15+1, 10**27+1), F(10**91, 10**71)' 'divmod(a, b)'
100000 loops, best of 5: 3.56 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**350+1, 10**207+1), F(10**891, 10**621)' 'divmod(a, b)'
20000 loops, best of 5: 17.3 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**891, 10**6111), F(10**350+1, 10**207+1)' 'divmod(a, b)'
20000 loops, best of 5: 18.2 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**350+1, 10**207+1), F(10**891, 10**6121)' 'divmod(a, b)'
10000 loops, best of 5: 31.7 usec per loop
Current master, for comparison:
$ ./python m timeit s 'from fractions import Fraction as F; a = F(7, 3); b = F(3, 2)' 'divmod(a,b)'
20000 loops, best of 5: 14.1 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**15+1, 10**27+1), F(10**91, 10**71)' 'divmod(a, b)'
20000 loops, best of 5: 16 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**350+1, 10**207+1), F(10**891, 10**621)' 'divmod(a, b)'
5000 loops, best of 5: 61.2 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**891, 10**6111), F(10**350+1, 10**207+1)' 'divmod(a, b)'
5000 loops, best of 5: 65.3 usec per loop
$ ./python m timeit s 'from fractions import Fraction as F; a, b = F(10**350+1, 10**207+1), F(10**891, 10**6121)' 'divmod(a, b)'
2000 loops, best of 5: 120 usec per loop
Definitely not an obvious decision, although there is a tendency towards faster execution for very large numbers. Whether it's faster or slower would probably depend on the data and the application at hand.
I could live with either choice, but would use divmod() for now since it simplifies the implementation.
