I have a routine for flat field correction (which every column in an image for non uniform lighting, usually lamp intensity at the sides is slightly less), but I implement that in 16-bit fixed point math to maximize register utilization.
If your input is 8-bit like mine, scaled/fixedpoint might also be a solution.