Ok, to explain this a bit more:
1. It is very hard to get threads to run at full speed, because they're probably waiting a long time to read or write stuff to and from the slow memory.
2. Intel tells us, that hyper-threading effectively doubles the amount of threads you can run.
3. The Windows threading model and the *nix one are vastly different.
A long time ago (in a galaxy far away?) we had Unix, which was multi-user and multi-tasking, and Windows 95, which only pretended to be those things. And, starting a process in Windows takes a long time. So, Microsoft went the way of the least resistance, and made it possible to just start threads at random, without any support structure whatsoever. "Now we are multi-whatever as well!!!"
The *nix guys had to laugh at that: they passed that stage more than twenty years ago. It wouldn't work. You need separation and messages to make it work. And they were right.
Then again, when Microsoft spends many millions on promoting it, who are you going to believe? Those old nerds? Nah, of course not! Microsoft is the winning team! Jump on the bandwagon!
So, for twenty years, people believed Microsoft, and tried to get it to work. And if they succeeded, profiling was not the way to go in most cases, as it showed that their multi-threaded application was actually slower than the single-threaded one. WTF!!!
And, of course, Linux now has threads and all the other crap as well: we have to copy the winning team!
Nowadays, most programmers who use multi-whatever a lot, have taken a step back and they don't use most of the things Microsoft tells them to. Like, shared memory. Or keeping pointers to other threads. Or, actually the whole programming model as they learned it on school. Because, it doesn't work!
Then again, how do you do it, in a way that it actually works? Can I get an education for that? Not yet, probably. Or, not any more.
So, when I made a fast, working, multi-platform threading library a few years ago (in C++, unfortunately), I did have to figure it all out myself. And test it as well. And I can state, that my testing showed, that both on ARM and Intel processors, in multiple variants and on multiple platforms (Windows, Linux, OS-X, iOS and Android), two threads for each core, that are designed to run at full speed (they use very little memory) are fine, while three will crash the whole box after a while.
But, as stated, it is very hard to make useful threads/processes that do more useful things than just benchmarking, that run at full speed. Most of those threads/processes spend the majority of their time idle/suspended/waiting for memory or I/O.