On systems that have strlcat() and strlcpy() in libc, run the tests twice (once with our implementation and once with the system's) to verify that our tests are correct.
Travis forces _FORTIFY_SOURCE, which enables warn_unused_result annotations in glibc. Some of those annotations are of dubious value; in the case of asprintf(3) and vasprintf(3), they flag code that doesn't check the return value as unsafe even if it checks the pointer instead (which is guaranteed to be NULL in case of failure, and arguably more useful than the return value). Unfortunately, gcc intentionally ignores (void) casts, so we have no choice but to quench the warning with -Wno-unused-result. However, some of the compilers we wish to support don't recognize it, so we move it from the developer flags to the Travis environment.
While there, switch Travis from Precious to Trusty.
It was wrong to remove $(AM_CPPFLAGS) in d43a6bf2, because it is only used for code for which there is no explicit *_CPPFLAGS. It is not entirely clear why this did not trip us (or Travis) up until now, although it is possible that it only breaks when $(builddir) != $(srcdir).
On the other hand, there is no reason to use $(INCLUDES).
The count we passed to memcmp() in mpi_eq() and mpi_eq_abs() was actually the number of significant words in the MPI, rather than the number of bytes we wanted to compare. Multiply by 4 to get the correct value.
To make the intent of the code more apparent, introduce a private MPI_MSW() macro which evaluates to the number of significant words (or 1-based index of the most significant word). This also comes in handy in mpi_{add,sub,mul}_abs().
Add a couple of test cases which not only demonstrate the bug we fixed here but also demonstrate why we must compare whole words: on a big-endian machine, we would be comparing the unused upper bytes of the first and only word instead of the lower bytes which actually hold a value...
In my eagerness to eliminate a branch which is taken once per 2^38 bytes of keystream, I forgot that the state words are in host order. Thus, the counter increment code worked fine on little-endian machines, but not on big-endian ones. Switch to a simpler (branchful) solution.
The current version invokes undefined behavior when the count is negative, zero, or equal to or greater than the width of the operand. The new version masks the count to avoid these situations. Although branchless, it is relatively inefficient if the compiler does not recognize it and translate it to a rol or ror instruction. Empirical tests show that both clang and gcc get it right for constant counts, and recent versions of clang (but not gcc) get it right for variable counts as well. Note that our current code base has no instances of rolN / rorN with a variable count.