Why is pure-Go crypto so slow?17 Nov 2016
- Because small functions have their runtime dominated by function-call overhead.
- Because data is stored on the heap unnecessarily.
- Because you can't take advantage of the featurefulness of assembly.
Function calls in Go almost always translate into function calls at the assembly level, which requires storing the contents of the CPU's registers and making space on the stack for the function's arguments, return, and internal variables. This process takes 2.5ns to 3ns. The runtime of small, highly-optimized functions is easily dominated by the function-call overhead. This in turn causes the runtime of larger functions to become dominated by the function-call overhead of their constituents. The only way to avoid this is to make sure that the small functions are inlined into the larger functions. You can check at compile-time with special flags:<pre> $ go build -gcflags -m main.go # command-line-arguments ./main.go:16: can inline add <b>./main.go:10: inlining call to add</b> ./main.go:12: res escapes to heap ./main.go:12: main ... argument does not escape </pre>
Unfortunately, Go's inliner is very weak and sometimes code has to be jiggled around to get it to kick in. It computes a metric of each function called its "hairiness" and will refuse to inline functions which are too hairy. You can also use the same compiler flags to check that certain variables are being stored on the stack instead of the heap -- often data will get put on the heap simply because it is used in a call-stack that is slightly too deep.
Also, like all high-level languages, Go doesn't expose a lot of the features of raw assembly. The biggest offender in this case is the full 64-bit multiplication operator, which allows you to multiply two 64-bit integers and get a 128-bit intermediate. Within the language, the best you can do it multiply two 32-bit integers and get a 64-bit intermediate, or multiply the same 64-bit integers together in several different ways such that you can recover the 128-bit intermediate. Both of these approaches require 3 to 4 multiplications (rather than 1), and carry the additional cost of combining the output of these operations.
Another problem is efficiently handling carries. CPUs have a 1-bit 'flag' which
is set to 1 when the output of an addition or multiplication overflowed, and
many instructions have variants that will incorporate the carry flag for free.
There's no feasible way to access this feature in higher-level languages, which
forces them to resort to slower methods of overflow detection like
z := x + y;
z < x || z < y and
z := x * y; z/x != y (missing edge-cases) [1, 2].
One of the algorithms I wanted to try for fast modular inversion  had this in pseudo-code:
Find the largest e such that 2^e evenly divides y.
The obvious, but slow way to implement this is with a loop, shifting right by
increasing amounts until
y>>e is odd. Initially, I tried to make it fast with
x & -x to find 2^e directly, but this made other things in the
algorithm harder. It turns out there's an instruction in x86 assembly that does
exactly this: 
BSFQ R12, CX
One instruction! One fraction of a nanosecond, rather than hundreds of nanoseconds in a loop which is already going to be run hundreds of times.
- Prime numbers: a computational perspective by Crandall, Richard, and Carl Pomerance. Section 9.4.2, Algorithm 9.4.5 (Inversion modulo a Mersenne prime).