I always forget about that VRAM wait state. But anyway, copying 64K of data shouldn't be irrelevant. Of course it depends on optimization of other parts. I remember, that back in old times, when I was trying to make my own game for i286 12Mhz, copying video page was taking all free CPU cycles, not leaving anything for game logic.
Yeah, unchained mode is little bit harder, but it's about writing every 4th pixel and then repeating it for other 3 bit-planes.