To get R's , G's and B's in a register together I first vshufb with this
const
splitsh6 : array[0..31] of byte = ( $00,$04,$08,$0C,$01,$05,$09,$0d,
$02,$06,$0A,$0E,$03,$07,$0B,$0F,
$00,$04,$08,$0C,$01,$05,$09,$0d,
$02,$06,$0A,$0E,$03,$07,$0B,$0F);
Which sorts according to color code. and then permutate the dwords using vpermd
permto8 : array[0..31] of byte = ( $00,$00,$00,$00,$04,$00,$00,$00,
$01,$00,$00,$00,$05,$00,$00,$00,
$02,$00,$00,$00,$06,$00,$00,$00,
$03,$00,$00,$00,$07,$00,$00,$00);
Which means that after two instructions I have
r0..r7,g0..g7,b0..b7,a0..a7 in one register. (in reality it is abs(r0-r0')...abs(r7..r7'), I already did a subtraction with saturation)
I then grow a register with just r's,g's'b's each (ignoring a) and just add ( and store them as 8 bit colour distance.
The RGB to rgba is SSE2/3 and shuffles, shifts and ors.
These are all utility routines to be able to prototype at a sane speed with large images (say 10 fps). They are optimized but still usually just one transformation per pass (which is fast enough, and often SSE since that is more convenient ).
For production, specialist routines that do multiple routines (debayer - abs(subtract) - color channel sortingsubtract) - store)