In my eagerness to eliminate a branch which is taken once per 2^38 bytes of keystream, I forgot that the state words are in host order. Thus, the counter increment code worked fine on little-endian machines, but not on big-endian ones. Switch to a simpler (branchful) solution.
Single-DES is now a special case of triple-DES with all three keys being the same. This is significantly slower than a pure single-DES implementation, but that's fine since nobody should be using it anyway.
We now have separate encryption and decryption methods, and can process an arbitrary amount of plaintext or ciphertext per call, rounded down to the block size (if applicable). For stream ciphers, we also have a keystream method which fills the provided buffer with an arbitrary amount of keystream (once again, rounded down if applicable).
The rk pointer in struct aes_ctx always pointed to the context's buffer and served no purpose whatsoever, but the compiler had no way of knowing that and could therefore not optimize away assignments to and from it.
Note that the removal of rk breaks the ABI, since it changes the size of struct aes_ctx, but we allow ourselves that because neither the API nor the ABI have been fixed yet.