When you have multiple threads that read/modify the same things and have to wait for completion, it is often faster to do it serial, in one thread or the main program. You need to do things as asynchronous as possible.
The optimal way to use threads is by giving them all self-contained tasks. You need (at least) two queues: one to post jobs and one for the results. You could spawn a new thread for each task, but there is an optimal number and with a queue, not only the main program can post jobs, but threads can do that as well. That way, you can do searches and rendering by partitioning the data.
Look at
the PS3 CPU. You use DMA to copy the data (and optimally the code as well) needed to a CPU, have it run the task and use DMA to move the data back to main memory. Unfortunately, most multi-core CPU's (except microcontrollers) don't have local storage (although they partially use their cache for that and have shadow registers), so keeping the memory regions separate is important. Basically: don't use pointers, but make a copy of all the data.
That sounds like a lot of extra work, but it is definitely the best way to utilize your CPU for 100%.