Btw, ironically enough, most current CPUs spend the vast amount of their time either sleeping or waiting on the very slow main memory. Your average i7 spends on average 90,000 clock cycles waiting for a single byte to arrive from main memory. In that time, it could execute up to 720,000 instructions...
Then again, it is very unlikely it could write the results of those instructions back in any meaningful way, and accessing more than one byte sequentially is much faster.
On the other hand, your working set for all processes/threads combined, spread out over all processors in the system is just 32kB, if you use shared memory...
And the most important thing you want in cache memory is your page tables! Which have become huge!
A double page fault requires reloading that part of the page table from disk, which is so excessively slow that it will take many millions of cycles.
Then again, cache memory already fills more than 75% of the die of a CPU, and suffers seriously from diminishing returns.
In short, the best ways to speed up processing are:
1. Faster and/or wider memory access.
2. Using an SSD.
But, structurally, you want to get rid of the slow memory dictating speed, and use that die area occupied by cache memory in a more efficient way.
The best way to do that, is to regard hard disks as really slow memory, SSD's as very slow memory, RAM as slow memory and use the cache memory as main memory.
To explain: instead of just a few very fat and starved cores, you want many lean and mean ones. Get rid of registers as we know them, and give them multiple kilobytes of local storage, that is at least as fast as L2 cache memory, or even L1.
But, you want to be able to swap processes/threads, and send all the other processes/threads extensive messages, both which require you to be able to swap that local storage, fast.
Or, in other words: treat main memory and anything slower as a slow block device (map global page 087243658 to local page 24), and use paging to give all CPU cores a nice block of really fast local storage (previously cache memory).
And for synchronizing stuff, you create a build-in messaging on the system and chip level. Just a few words. Alerts.
But, in the mean time, we have to make do and use main memory as the best we have.