First of all. The hashes are NOT a must, if you do your own highlighters.
As for the build in highlighters, there may be equally fast or even faster ways. But changing the current implementation will have to be very well considered, and tested in various ways. Including full test-suite (unit tests), so (hopefully) no functionality will be broken.
"WHY the hash?"
Don't know, that was done by someone maybe a decade ago.
---------------
The speed of finding tokens depends on many factors. Hashes are fast as far as algorithm goes (big O notation), true. But on very small data they may have more overhead, and even be slower.
Next to considering the algorithm will be the implementation. Code and data size (as well as data distribution in memory) can make huge differences, Can the CPU fit all into the cache (into as little cache lines as possible [
http://en.wikipedia.org/wiki/CPU_cache#Cache_entries ]) ? Though with todays cache technologies, again a few 100 bytes will not matter.
Looking at your comparison:
Well the HL use a hash, but it is not a "perfect hash". So for every hit, the keyword is again compared.
I would *assume* that in a text that as a high amount of keywords [1] (99% or more), your HL may be faster. While if there are no keywords the hash will be better.
[1] anything that the HL will match.
Another option would be a trie (google: aho corasick), but that can use lots of memory, and then the cache issue may make it slower. Though it would be good for the SynAnySyn.
There are also thinks like pplex (IIRC there is one that does pascal). They can generate lexers. So there is no need to worry about spaghetti code with long ifdef, case or other stuff.
---------------
BTW, a good tool to measure speed is callgrind (part of valgrind) and kcachegrind to view results.
But it is linux only.