If my math doesn't leave me here the table size in Laksens example is much to large. It is enough to consider the least 8 bit for the calculation, that is the table size needs to be 256 word only. These 512 byte total memory will not degrade cache efficiency (like Lucamar assumed) with exception of the smallest MCU.
Forgot to mention that the approach with the reduced table then still requires one xor operation in addition to the table lookup - this might not have been clear from my explanation above. Nevertheless I think it might be the best compromise in terms of speed & density wrt the specific environment.
@Laksen - did you by any chance also compare the speed of the branch free with the branched version? It was a shot from the hip, to some degree.
Cheers,
MathMan