Using multiple threads could improve latency if you are doing any I/O operations while polling, like writing something to SD card, etc. However, you should be sharing the same TFastGPIO instance between the polling threads. Reading each of the pins should be pretty fast since essentially you are just reading a value from a mapped memory location. If you are using Raspberry PI 2 or later, make sure to compile with "RPi2" define enabled to use better instructions for GPIO.
As for interrupts, I'm not sure you can use them directly from the application; BCM does expose some information for the pins like what pin has changed its value recently, but haven't looked into that.