I will check the loop unrolling - at least for the encoding part
[/quote]
mormot is doing 48byte->64byte per loop iteration, you are doing 24byte->32byte
[/quote]
Yes this is correct - though... I did some small tests trying to emulate 48Bytes and still did
not achieve the throughput as they did.
What I think is really weired - since they basically use the same algorithm. What I noticed is
that they use a wired load/store scheme. Instead of loading a register in one op they
load first the lower parts of both registers then the upper parts basically doubling the amount of
load/store operations - it seems that does the trick but nevertheless ... it's weird...
and also I tried to just duplicate the block that does the encoding magic whereas they do
an interleaved double block - this maybe get better throughput too.... but.. I'm still a bit testing (though I'm quite
happy about the throughput anyway...)