Lazarus

Programming => Packages and Libraries => SynEdit => Topic started by: Edson on September 11, 2013, 08:45:45 pm

Title: Checking Speed of SynEdit Highlighters
Post by: Edson on September 11, 2013, 08:45:45 pm

I have been analizing the implementation of several languages highlighters for SynEdit and I wondered if it is necessary to use a hash-function method for to identify tokens. I know that method is very fast, but complicated and no readable, moreover it is difficult for configurate from an external file.

I have used a simpler method for implementing a highlighter (http://blog.pucp.edu.pe/media/4946/20130905-coloreado_de_sintaxis_con_synedit.pdf - 1.3.6) without using hash functions and I discovered it was, the same of fast that the hash function. It is based on using the first character of tokens for searching.

I made some tests for getting the real speed of the highlighters. I used the PHP highlighter like a sample(because it is simple). This is my result:

Time for proccesing a sample PHP file for 100000 times:

* SynAnySyn highlighter configured as PHP:	67 seg, 67 seg, 67 seg.
* PHP highlighter (with hash functions):	11seg, 11 seg, 11 seg.
* PHP highlighters with alternative method:	11 seg, 11 seg, 12 seg.

Well, we all knew the SynAnySyn, was slow. But, I surprised that the common PHP highlighter with hash functions, is the same fast as a no-hash functions highlighter.

Probably I should make more tests, but by now, I ask:

Why we complicate using hash functions?

And my conclusion is:

"It is possible to implement a fast highlighter without using the cryptic hash functions."

I think this could help for improving the SynAnySyn highlighter and using an external file for syntax of SynEdit.

Do you know of some alternative method for implementing fast highlighters?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on September 11, 2013, 10:24:31 pm

First of all. The hashes are NOT a must, if you do your own highlighters.

As for the build in highlighters, there may be equally fast or even faster ways. But changing the current implementation will have to be very well considered, and tested in various ways. Including full test-suite (unit tests), so (hopefully) no functionality will be broken.

"WHY the hash?"
Don't know, that was done by someone maybe a decade ago.

---------------

The speed of finding tokens depends on many factors. Hashes are fast as far as algorithm goes (big O notation), true. But on very small data they may have more overhead, and even be slower.

Next to considering the algorithm will be the implementation. Code and data size (as well as data distribution in memory) can make huge differences, Can the CPU fit all into the cache (into as little cache lines as possible [ http://en.wikipedia.org/wiki/CPU_cache#Cache_entries ]) ? Though with todays cache technologies, again a few 100 bytes will not matter.

Looking at your comparison:

Well the HL use a hash, but it is not a "perfect hash". So for every hit, the keyword is again compared.

I would *assume* that in a text that as a high amount of keywords [1] (99% or more), your HL may be faster. While if there are no keywords the hash will be better.

[1] anything that the HL will match.

Another option would be a trie (google: aho corasick), but that can use lots of memory, and then the cache issue may make it slower. Though it would be good for the SynAnySyn.

There are also thinks like pplex (IIRC there is one that does pascal). They can generate lexers. So there is no need to worry about spaghetti code with long ifdef, case or other stuff.

---------------

BTW, a good tool to measure speed is callgrind (part of valgrind) and kcachegrind to view results.

But it is linux only.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on September 12, 2013, 04:08:45 am

Hi Martin,

I don't pretend to change the current highlighters. You are right. It will be a risk by now. It's necessary to make more tests. And if it do, it will be a lot of work.

Probably, new highlighters must consider not to use the hash method.

I point to have an alternative method (and simpler) for implementing highlighters. And specially, have a Fast SynAnySyn Highlighter, that can use an external file for the sintax. For all I Know, it doesn't exist.

Of course I also, wanted to show statistics about highlighters speed.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Leledumbo on September 12, 2013, 07:14:12 am

Prefix tree is the fastest of all, as it compares the string exactly only once (O(n) where n is the string length). The space complexity is not so big either (compared to hash table), but unless you're implementing lexers for languages with A LOT of keywords (C++, COBOL, etc.) it would be somewhat overkill.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on September 13, 2013, 05:06:27 am

That's right. I have used a Prefix Tree, but implemented for only the first level. That's why it's easy to implement, because full implemented should be complicated. Of course this method could be slower if the sintax has many words starting with the same character. In that case, it could be improved using a second level of the tree.

I discovered an error in my test, correcting , I have nice a surprise:

* SynAnySyn highlighter configured as PHP:	67 seg, 67 seg, 67 seg.
* PHP highlighter (with hash functions):	11seg, 11 seg, 11 seg.
* PHP highlighters with alternative method:	10 seg, 10 seg, 10 seg.

A "One level Prefix Tree" method is not only simpler. It is even faster than the hash method.

I will try to make a highlighter using an external file and test the speed.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on September 22, 2013, 08:39:09 pm

Finally I have implemented a simple Highlighter configurated with a External file syntax.

I made some new comparative tests (always using the PHP language as "guinea pig"):

* SynAnySyn highlighter configured as PHP:	6.12 seg, 6.22 seg, 6.12 seg.
* PHP highlighter (with hash functions):	1.04 seg, 1.02 seg, 1.03 seg.
* New highlighters with external PHP file:	1.03 seg, 1.02 seg, 1.02 seg.

I still have to implement some features, but until now, it's faster enough.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 13, 2013, 04:57:39 pm

Quote from: Edson on September 11, 2013, 08:45:45 pm

Why we complicate using hash functions?

The speed of an algorithm depends on CPU, compiler and available memory.
In this case also from the graphic routines and system.

I designed mwEdit for Delphi not FPC and to run fast on CPUs below 100MHz and memory in single number MB space.
mwEdit was just intended to be an example on how to create a fast and versatile highlighting editor.
The version only written by myself was significantly faster than Delphis own,
and the only available behaving like it.
Once it was published lots of other editors emerged.

BTW it is even possible to create highlighters in C# or any other NET language
without much speed penalty, provided one knows how.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 13, 2013, 06:11:44 pm

Hi Norton,

I didn't know you were the designer of the SynEdit.

Quote

The speed of an algorithm depends on CPU, compiler and available memory.

It's true. I'm aware of that.

I don't say that Hash method is the worse, but probably in this case it isn't the best.
It's always good to question the efficiency of the algoritms we are using. I have been testing the speed of a highlighter and have made many tests. But, like I said. it's necessary to make more tests. I haven´t tested many cases.

When I was designing my highlighter I was aware of the size of the code too. No only the speed. And other objetive was to be easy for maintaining.

Quote

BTW it is even possible to create highlighters in C# or any other NET language
without much speed penalty, provided one knows how.

It's curious because I have implemented a Editor with syntax highlighter, autocompletion (that responde to the context), and "Column Mode Edit", without external dll's nor ocx's in VB6:

http://blog.pucp.edu.pe/item/178457/editor-con-sintaxis-coloreada-en-visual-basic-6

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 13, 2013, 06:37:34 pm

Quote from: Edson on October 13, 2013, 06:11:44 pm

I don't say that Hash method is the worse, but probably in this case it isn't the best.

Maybe this simple hash is not the best for FPC and modern CPUs,
but for the version of Delphi and CPUs available at this time it was.
Last time (years ago) I have tested it Delphi outperformed FPC by two times with this algorithm.

BTW even preselection based on just the first char is a hash.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 13, 2013, 07:11:31 pm

It is known that in many cases Delphi can optimize code better than FPC. (Though I have not checked very recently).
But comparing algorithms, we should compare with the same compiler.

The hash is a well performing and cheap to implement way.

The speed will also depend on:
* the data:
- the hash can perform well, if the percentage of keywords is low, because then the hash value already decides that there is ne match. (and the computing of the hash, is a very small bit of code, optimal for making use of cpu cache, and probably other optimizations performed by the cpu.
- using a prefix will become very slow, if a pascal source contain many identifiers starting with a keyword, like "WhileFoo", "BeginField", ... because they all must be compared to the first after the keyword.

In other words, the prefix may perform better on best, and maybe average case, but probably it has a far worse worst case.

* implementation.
The current hash base does a lot of subroutine calls. that slows stuff down.
Maybe the new prefix based implementation has less function calls / more code in a single procedure?
- inline some (e.g. KeyHash, IdentKind)
- skip creation of exception frames ({$IMPLICITEXCEPTIONS off})
(do have a fair compare, add this to BOTH)

The difference measured between the 2, may be caused by the algorithm, but it may as well be by the implementation.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 13, 2013, 08:15:27 pm

So, without consider the implementation, we could say:

* If your syntax have many "keywords" starting the same char, like "while1", "while2", ... it's better using hash function (assuming your hash function can distribute well the keywords, other way it have no sense)
* If your syntax have his "keywords" well alphabetic distributed, use First-char-prefix.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 13, 2013, 08:37:51 pm

Well both algorithm use an additional string compare, if the first step says that there may be a match.

I have not gone and calculated the exact cost of the 2 algorithm in full (including this string compare).
My statement is just an assumption.

But since the string compare is the same in both, it simply depends, on how often the string compare is triggered for none keywords.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 13, 2013, 09:14:52 pm

But there is a difference:

* Hash functions:
Start with alphabetic -> pass to funtion to capture identifiers.
Capture identifier, and calculate Hash funtion.
With hash function value, call to function to process.
Make comparisons.

* First-char-prefix:
Start with alphabetic -> pass to prefix-function.
Capture identifier, and make comparisons.

There is an extra load for hash method.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 13, 2013, 10:09:25 pm

That is what I wrote. But that is a difference in the implementation, and NOT in the algorithm.

You can implement the hash with less function calls.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 13, 2013, 10:15:39 pm

But always, you'll need to do one more step (calculate the hush function). It's algorithm.

There is another difference:

* Hash functions:
Make comparisons with n characters.

* First-char-prefix:
Make comparisons with n-1 characters.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 13, 2013, 10:32:56 pm

Quote from: Edson on October 13, 2013, 10:15:39 pm

But always, you'll need to do one more step (calculate the hush function). It's algorithm.

There is another difference:

* Hash functions:
Make comparisons with n characters.

* First-char-prefix:
Make comparisons with n-1 characters.

Yes, but depending on the hash algorithm used, you may have less false positives. (less string compares, for none keywords).

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 13, 2013, 10:47:48 pm

The current hash returns 0..191 (that are 192 values) of which 89 have a function assigned

So over half of the calculated hashes are immediately resulting in "not a token". Originally that ratio was better, because less keywords existed (fpc was extended)

In your case there are 26 letters in the alphabet (case does not matter, all identifiers in pascal start with a..z).
I do not know how many of them have a keyword? If I counted correct about 20.

That means 76% of the letters, will require a compare of the string. That is way more than when using the hash.

And on top, the hash could be modified to use a bigger result range, improving its efficiency.

--
Note: I do not say that the hash is indeed better. But it may possible be. (That is comparing the average case / and assuming a better implementation)

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 13, 2013, 11:43:27 pm

You are right Martin. :)

That is the only point, where I see, the hash function could be faster, because of the false positives.

I can see now, that we know very well the advantages and disadvantages of both methods. (I will document this on my article about SynEdit.)

But don't forget we have at least one test. I will put on this way:

* PHP highlighter (with hash-functions and 46 functions): 11seg, 11 seg, 11 seg.
* PHP highlighter (with first-char-prefix and 26 functions): 10 seg, 10 seg, 10 seg.

(Probably they have not the best implementation)

PD: Which highlighter hash returns 192 values and have 89 function assigned?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 14, 2013, 12:26:46 am

Quote from: Edson on October 13, 2013, 11:43:27 pm

PD: Which highlighter hash returns 192 values and have 89 function assigned?

Pascal

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 14, 2013, 01:18:50 pm

Quote from: Martin_fr on October 13, 2013, 07:11:31 pm

* implementation.
The current hash base does a lot of subroutine calls. that slows stuff down.
Maybe the new prefix based implementation has less function calls / more code in a single procedure?
- inline some (e.g. KeyHash, IdentKind)
- skip creation of exception frames ({$IMPLICITEXCEPTIONS off})
(do have a fair compare, add this to BOTH)

The call tables are part of the algorithm, which performs extremely well with highly optimizing compilers.
Ported to NET with a few changes to prevent to much safety guarding it will still outperform FPC.
The cross platform nature of FPC makes highly optimizing a lot
harder than for single platform ones.
Maybe the newer FPC compiler could optimize better if the right directives are used.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 14, 2013, 07:48:53 pm

So probably, our FPC is not so fast as we believe.

Any way, if the "call tables" is slow, it will affect both algorithm, especially to the hash-functions, because it use two "call tables" (the First-char-prefix just one).

Title: Re: Checking Speed of SynEdit Highlighters
Post by: marcov on October 14, 2013, 08:16:48 pm

Quote from: Edson on October 14, 2013, 07:48:53 pm

So probably, our FPC is not so fast as we believe.

I don't know, I couldn't find Morton's report nor could I reproduce his findings.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 14, 2013, 08:59:57 pm

It is not the call tables that are slow ( I guess that was somehow derived from an earlier statement of mine, but that is not what I said.)

If code that is called very very very often, is spread over more subroutines, and has to make calls to those, then that code may be slower then the same code all inlined.

Note: it "may". It is not in all cases. I never said, it was always.

The entire point I made was, that the tiny speed difference that was reported *can* same as good be due to implementation, as it can be due to algorithm.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 15, 2013, 03:41:09 pm

Quote from: Edson on October 14, 2013, 07:48:53 pm

So probably, our FPC is not so fast as we believe.

Just NET is not as slow as many believe, ideed it can be very fast.
Net may be slowed by safety guarding.
There is much less safety guarding within the current scope.
Last time I have checked simply assign something within the current method eg. a string and then accessing the new assignment improved speed significantly if accessed more than one or two times eg. by scanning a string.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 16, 2013, 02:03:24 pm

For the Pascal highlighter.
With standard optimization:
{$inline on}
{$IMPLICITEXCEPTIONS off}
and inlining KeyHash, KeyComp, Next and GetEol gives about 18% more speed.
Reassign the aKey as PChar in KeyComp gives another percent.
Inlining the table calls seems not to be possible.
With the highest optimization the difference is about 40%.
Thats about 70MB sec. instead of about 40MB sec. on my PC.

BTW I have noticed some needles string copying,
which should not occur in a high performance tool.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 16, 2013, 02:54:51 pm

Quote from: Morton on October 16, 2013, 02:03:24 pm

BTW I have noticed some needles string copying,
which should not occur in a high performance tool.

Can you point them out please?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 16, 2013, 03:16:08 pm

Quote from: Martin_fr on October 16, 2013, 02:54:51 pm

Quote from: Morton on October 16, 2013, 02:03:24 pm
BTW I have noticed some needles string copying,
which should not occur in a high performance tool.

Can you point them out please?

Func89
Txt := copy(fLine, Run + 7, length(fLine));

Not the fastest possible too.
(AnsiStrComp(PChar(copy(Txt, i+1, 6)), PChar('rivate')) = 0) )
or ( (i+8 <= l) and
((i+8 = l) or (Txt[i+9] in [#1..#32])) and
(AnsiStrComp(PChar(copy(Txt, i+1, 8)), PChar('rotected')) = 0) )

BraceOpenProc;
Txt := copy(fLine, Run, length(fLine));

(AnsiStrComp(PChar(copy(Txt, i+1, 4)), PChar('fold')) = 0)

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 16, 2013, 04:20:09 pm

Fixed...

Well: the first block ("strict private") of code can probably be replaced by something far simpler. I need to review that...

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 16, 2013, 06:51:29 pm

From playing a bit with the highlighters I would reccoment just placing {$OPTIMIZATION ON} in highlighter units. Will enhance speed greatly. The effect of {$IMPLICIT EXCEPTIONS off} is then neglible and inlining may have even a negative effect if the project does not use at least level 2 optimization.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 16, 2013, 07:13:09 pm

{$IMPLICIT EXCEPTIONS off} are mainly in combinations with strings (except const param). So using more pchar can help here.

Applications can add callbacks via HookAttrChangeEvent.

When this callback is triggered, and user code throws an exception, the any HL method that is in the callstack at that time will need the implicit exceptions (or leak memory).

This will be only a very few methods, but they need to be identified. Also this should not include any of the methods involved in the actual scan.
But the directive should only cover parts of the unit.

IMHO optimization should follow the project/package settings.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 16, 2013, 08:03:14 pm

Quote from: Martin_fr on October 16, 2013, 07:13:09 pm

IMHO optimization should follow the project/package settings.

IMHO well tested units, especially high performance ones, should use the highest save optimization
and not depend on project/package settings.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: taazz on October 16, 2013, 08:16:39 pm

Quote from: Morton on October 16, 2013, 08:03:14 pm

Quote from: Martin_fr on October 16, 2013, 07:13:09 pm
IMHO optimization should follow the project/package settings.

IMHO well tested units, especially high performance ones, should use the highest save optimization
and not depend on project/package settings.

as long as it doesn't interfere with debug info, stepping into and variable watch I agree that it is imperative to add the needed optimizations.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 16, 2013, 08:40:09 pm

IMHO the question is, if the entire package should be shipped -O2

Or better: If it should follow the project on this. So buildmodes would take care of everything.

It is possible to force buildmode settings from the project to the package, but it needs to be set up in the project, and it is not as trivial as it should be.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: taazz on October 16, 2013, 09:06:46 pm

optimizations that do not interfere with debugging can be set at the unit level otherwise they should always follow the projects settings (not the package all the way down to fpc packages), that will make things easier for settings like optimization, dynamic linking etc.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 17, 2013, 07:03:15 am

After some tests I have new comparisons of speed. Now I have choosen the Lazarus Perl Highlighter:

* Perl highlighter (hash-functions: 238 functions of 2167): 1.84seg 1.82seg 1.83seg
* Perl highlighter (with first-char-prefix: 26 functions): 1.75seg 1.76seg 1.75seg

Now I have taken the Perl highlighter (hash) and just replaced the hash algorithm to the prefix algorithm, keeping the others functions. So the comparison is very fair.

I hope someone could verify my results.

Attached: the prefix-highlighter used and the perl file of test.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 17, 2013, 11:02:19 am

Edson you could try another version with a case table instead of the call table.
Maybe casetables allow inlining.
Ordering your "if" cascades by statistical frequency may help to.
Some of your "if" cascades could likely be improved by a case table.

To get the last bit of of performance:
With any compiler I have ever used it may sometimes be helpful to insert a never used method to improve alignment.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 17, 2013, 12:29:05 pm

Edson with level two or three optimization yours is slower.
As I said the hashed call tables works best with high optimization.
Linux 64Bit.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 17, 2013, 03:43:02 pm

Edson your Perl HL produces a different amout of tokens than the one included in the newest Lazarus.

To all:

I have replaced the "fIdentFuncTable[HashKey]" with a "Case HashKey of"
in the Pascal highlighter.

Newer FPC compilers have changed nothing.
in contrary to Delphi:
"fIdentFuncTable[HashKey]"
has no significant speed advantage to
calling the methods with a "Case HashKey of"

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 17, 2013, 04:02:27 pm

Not sure if fpc optimizes "case of" by using jump tables, or if they always are cascaded "if". Could find out with -al , looking at the asm generated.

Have you tried adding "nostackframe;" to the "func123" procedures?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 17, 2013, 04:14:47 pm

Thanks Morton for checking my results.

Quote

Edson you could try another version with a case table instead of the call table.
Maybe casetables allow inlining.
Ordering your "if" cascades by statistical frequency may help to.
Some of your "if" cascades could likely be improved by a case table.

OK.
Probably this is not a fair comparison, because the prefix-algorithm is bad implemented. I just wanted to show to even in this way, the prefix algorithm could be faster. I'm sure that including just one more case table, it will be faster.

Quote

Edson your Perl HL produces a different amout of tokens than the one included in the newest Lazarus.

I'm comparing with the Perl Highlighter of Lazarus 1.0.12. I checked the tokens for a small sample, and it was the same.
Moreover, this prefix-highlighter is just for testing, no for production.

I'll try to do a better comparison.

PD. I haven't compared the Pascal Highlighter, because it has foldind implemented.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 17, 2013, 04:31:08 pm

nostackframe above level 1 optimization causes access violations with call tables but not
with Case tables.

No noticeable speed effect.
Tested with Edsons Perl HL which has not so much methods.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 17, 2013, 04:46:20 pm

Quote from: Edson on October 17, 2013, 04:14:47 pm

I just wanted to show to even in this way, the prefix algorithm could be faster.

As noted in my first post: It may depend on the compiler what works best.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 17, 2013, 08:21:03 pm

Just adding a simple case and interchanging one if (like Morton suggested), these are the new results:

* Perl highlighter (hash-functions: 238 functions of 2167): 1.84seg 1.82seg 1.83seg
* Perl highlighter (with first-char-prefix: 26 functions): 1.68seg 1.68seg 1.68seg

And this is, having still, a bad implementation of the prefix algorithm. It could be easily improved.

I really reconsider using the hash-algorithm for highlighters. IMHO, this is an issue of algorithm more than of compiler.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 18, 2013, 11:51:24 am

Quote from: Edson on October 17, 2013, 08:21:03 pm

IMHO, this is an issue of algorithm more than of compiler.

Some things will not work with another compiler or not as expected.

BTW the fastest possible Algorithm would be to use a multi dimensional matrix.
The length of the longest key gives the number of required dimensions.
The required amount of memory is prohibitive.
A sparse version implemented with dynamic arrays may be feasible on todays PCs.
with many GB of RAM.
Trees are much slower and became obsolete with 32Bit.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 18, 2013, 02:32:28 pm

The fastest highlighter ever for Delphi: SynUniHighlighter by Fantasist.
Partly it was so fast because it leaked memory like hell in old versions.
Maybe fixed in newer versions.

It is criptable.
There have been hundreds of scripts for it.

Used here: http://sourceforge.net/projects/mystix/

Lazarus port: https://bitbucket.org/etrusco/lazarus-ide/src/22ba3d2ea4b7/components/synunihighlighter

Latest version: http://rghost.ru/46404542
Scripts: http://rghost.net/46404543

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 20, 2013, 02:13:16 pm

With some refactoring, the speed of the hash bashed implementation can probably be increased (maybe even drastically):

1) Ideally choose a hash that is "perfect" for the set of all keywords. So no hash value has more than one matching keyword

2) the important part:

Instead of a hash table containing function-refs, have a hash table with the keywords. (well keyword + function ref)

Then the compare with the keyword can be made before calling the function. That may saves a lot of unnecessary calls.

And if (1) the hash is perfect for the list of keywords, then the hash table only needs one entry per hash value. Otherwise a sorted list or prefix tree were needed.

Further more: the entries in the hash table could contain:
- keyword
- function (may be nil)
- type of ident (e.g. number, keyword, ...)

for keywords which only return a specific attribute, but do nothing else, no function needs to be called.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 20, 2013, 03:42:23 pm

A hash is simply a kind of sieve, a preselection to avoid many comparisons.
Which sieve is the cheapest in computing time depends on the string keys to process
and may vary a lot not only by implementation, but also from compiler to compiler.

An algorithm using character selection for every character of the string keys would be the fastest possible as it reduces the needed comparisons to zero.

Initializing boolean tables for every sieve level needed to process a given keylist
and store them in a level table to access them by a given level is feasible on todays PCs.
The amount of boolean tables ( sieve levels ) is determined by the longest key.
If the boolean table in all seeve levels needed for a given string lengt return true
then the result is true without the need of any comparisons.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 20, 2013, 04:49:02 pm

A dynamic two dimensional boolean table accessed by level and character could be used too.
Don't know if the level table with boolean tables or the two dimensional boolean table
would be cheaper in computing time for FPC.

The tables will have to indicate the possible token kinds for each level.
Two dimensions may not be enough to avoid comparision completely.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 20, 2013, 05:21:36 pm

Quote

An algorithm using character selection for every character of the string keys would be the fastest possible as it reduces the needed comparisons to zero..

That's is the objetive of the algorithm I proposed. The first-char-prefix is just the first level. I have indicated before:

"That's right. I have used a Prefix Tree, but implemented for only the first level. ... Of course this method could be slower if the sintax has many words starting with the same character. In that case, it could be improved using a second level of the tree."

I mean, the algorithm can be implemented for the second char, the third, ... Just add the branch you need in the necessary points (Could be a simple case of...).

Doing that way, there is no comparison. The hash will be always lose.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 20, 2013, 05:41:35 pm

Quote from: Edson on October 20, 2013, 05:21:36 pm

Quote
An algorithm using character selection for every character of the string keys would be the fastest possible as it reduces the needed comparisons to zero..

That's is the objetive of the algorithm I proposed. The first-char-prefix is just the first level. I have indicated before:

Quote

Doing that way, there is no comparison. The hash will be always lose.

Depends:

How well optimized is your prefix tree? Memory is ont at issue. Cache is. If it need a lot of memory access with cach misses, then it looses.
Depends on architecture and implementation.

Also code flow.
As far as I can see, the hash implementation is very good for cpu feature such as branch prediction (only one branch for end-of-word, almost always false (only true once per word)) and as a consequence data pre-fetching.
Again: Depends on architecture and implementation.

---
That does not mean that a prefix can not beat a hash (on implementation level). It may well do.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 20, 2013, 06:04:07 pm

Quote from: Edson on October 20, 2013, 05:21:36 pm

Doing that way, there is no comparison. The hash will be always lose.

Obviously not true, as the prefix is a hash too.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 24, 2013, 05:13:43 pm

Attached is an example implementation and test project of a key matcher, which can perform about twenty million matches per second on my AMD A8-5500, without doing any comparisons.

The principle can be used to match any kind of string.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 24, 2013, 05:24:24 pm

Implemented in a HighLighter the macher would perform even faster than by maching against a string list as in the example project.

The number of keys, even a huge one, hast little influence on the speed.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 24, 2013, 06:13:12 pm

Made a little mistake:
It is twenty million matches per second, not just two million. :D

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 24, 2013, 06:26:14 pm

And it matches the word "cabel".

Because it starts like "case" and ends like "label"

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 24, 2013, 06:33:13 pm

Quote from: Martin_fr on October 24, 2013, 06:26:14 pm

And it matches the word "cabel".

Because it starts like "case" and ends like "label"

Thanks for testing.
I think I have a solution.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 24, 2013, 06:52:13 pm

Quote from: Martin_fr on October 24, 2013, 06:26:14 pm

And it matches the word "cabel".

Because it starts like "case" and ends like "label"

Sorry, I cannot repeat that.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 24, 2013, 07:00:28 pm

I dont have the modified highlighter. But adding to your app:

procedure TForm1.Button1Click(Sender: TObject);
begin
if IsKeyWord('label') then Caption := Caption + '1';
if IsKeyWord('xabel') then Caption := Caption + '2';
if IsKeyWord('cabel') then Caption := Caption + '3';
end;

and press that button, will give 13

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 24, 2013, 07:22:32 pm

Quote from: Martin_fr on October 24, 2013, 07:00:28 pm

I dont have the modified highlighter. But adding to your app:

procedure TForm1.Button1Click(Sender: TObject);
begin
if IsKeyWord('label') then Caption := Caption + '1';
if IsKeyWord('xabel') then Caption := Caption + '2';
if IsKeyWord('cabel') then Caption := Caption + '3';
end;

and press that button, will give 13

Fixed with a bigger matcher.
A sparse implementation would be possible.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 24, 2013, 07:25:24 pm

For pascal you need at least a sequence of 6
operator + property = operty

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 24, 2013, 07:36:01 pm

Quote from: Martin_fr on October 24, 2013, 07:25:24 pm

For pascal you need at least a sequence of 6
operator + property = operty

Dont know what you mean with that;

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 24, 2013, 08:05:41 pm

Ok, I had not looked at your changes.

You added the length into the compare... (I thought, you may have added another 0..25 to the array.

Try

if IsKeyWord('object') then Caption := Caption + '1';
if IsKeyWord('except') then Caption := Caption + '2';
if IsKeyWord('objept') then Caption := Caption + '3';

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 24, 2013, 08:17:55 pm

Quote from: Martin_fr on October 24, 2013, 08:05:41 pm

Ok, I had not looked at your changes.

You added the length into the compare... (I thought, you may have added another 0..25 to the array.

Try

if IsKeyWord('object') then Caption := Caption + '1';
if IsKeyWord('except') then Caption := Caption + '2';
if IsKeyWord('objept') then Caption := Caption + '3';

Thanks for testing.
More dimensions are a possible solution.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 24, 2013, 08:26:36 pm

It can be solved, if all keywords are known (llike pascal).

But if I can add my own lists of keywords, I can probably always add 2 keywords, that allow a 3rd none keyword.

Instead a proper prefix tree should be used.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 24, 2013, 08:40:36 pm

Quote from: Martin_fr on October 24, 2013, 08:26:36 pm

It can be solved, if all keywords are known (llike pascal).

But if I can add my own lists of keywords, I can probably always add 2 keywords, that allow a 3rd none keyword.

Instead a proper prefix tree should be used.

Another try.
A tree would save memory, but slow down.
A sparce matrix would be faster.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on October 25, 2013, 12:34:25 am

Thanks for sharing your code.
There's no denying it is lightning fast.
However the algorithm is flawed. Even your latest revision (3) matches all kinds of words (am, an, any, ask, ox, them) and non-words (ia, nia) which are not Pascal keywords.
I prefer a slower algorithm that correctly matches what it says on the tin.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 25, 2013, 01:02:38 am

if IsKeyWord('interface') then Caption := Caption + '1';
if IsKeyWord('inherited') then Caption := Caption + '2';
if IsKeyWord('interited') then Caption := Caption + '3';

A prefix tree (trie) is like a matrix.

It is not a binary tree. each node can have a child for each letter. So each node holds an array[0..25] with poithers to the next node (if there is)

http://en.wikipedia.org/wiki/Trie

Only potential slow down is cache misses due to mem usage.

Such a trie exists in SYnEdit (MarkupHighlightAll).

Title: Re: Checking Speed of SynEdit Highlighters
Post by: sam707 on October 25, 2013, 01:07:45 am

Hi

I remember 1st time when I saw hash codes and tables : It was in 1979 when I disassembled microsoft level2 basic rom on my TRS-80 machine. These days, it was clever because :

1) the kewords were transformed to 1 byte in small ram
2) the original program listing was reconstituted, doing the opposite way
3) apple computers' basic interpreter did not have such assembler tokenization, they lost

If you want to implement a compression tool, a dictionary or a hash table, even a boolean or huffman like tree can be, for sure, helpful.

but in the case of a highlighter, compression is not the goal. I suppose that there could be faster ways, maybe a sorted list of all allowed keyword. As it is a sorted constant, a kind of 'prediction' -proceeding by elimination- can be make with a character scanner that abandon the seach instant when match fails.

This way of Highlight may break the "tradition" but also may have a chance to be the faster one, since nowadays the original text will stay in memory (no need to shrink like in 1980 when ram was so damn expensive, and when programmers used/invented tokens/hash tables for that shrinking purpose).

If you still believe in hash codes, ask yourself a few questions :

- I plan to spare memory, like around 1980 when I had 16Kb of ram only? no
- hash codes will be used to keep coordinates of font changing? no
- hash codes are going to help for a dichotomic seach on an orderd constant list? no, it is ordered !
... and so on

and the Light will come to you O:-)

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 25, 2013, 02:19:53 am

Quote from: howardpc on October 25, 2013, 12:34:25 am

I prefer a slower algorithm that correctly matches what it says on the tin.

Agreed.
Its a bit callow now, will have to loock for a sparse and dynamic solution.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 25, 2013, 02:30:16 am

Quote from: Martin_fr on October 25, 2013, 01:02:38 am

Such a trie exists in SYnEdit (MarkupHighlightAll).

Thanks.
A trie is not a Tree in the usual meaning.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 25, 2013, 02:55:18 am

@sam707
Thanks.
I have been teached to program a 4 bit computer in machine language,
not assembler as it did not had enoug memory for an assembler.
It had no display only lamps.
1979 I got my fist with a single line display and a keyboard with letters.
My first PC 1983 was a Toshiba T300, with CPM and DOS on floppies,
without a disk.
prices had been grazy at this time.
The list prices of the t300 and a little softwares summed up to over 20000 DM.
I remember those tricks needed to do something with extremely little memory.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: marcov on October 25, 2013, 01:43:25 pm

Quote from: Morton on October 25, 2013, 02:55:18 am

I remember those tricks needed to do something with extremely little memory.

(If you feel the need to practice, FPC/trunk now has a 16-bit tiny memorymodel)

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 25, 2013, 03:26:22 pm

Quote from: marcov on October 25, 2013, 01:43:25 pm

(If you feel the need to practice, FPC/trunk now has a 16-bit tiny memorymodel)

At the moment I squeeze 256 bytes plus some more info in 48 bytes;
Lets see how much it will slow down.

An array[byte] of BitBool occupying only 32byte would some times still be handy.
Canot find my libraries for this kind of stuff anymore;

Title: Re: Checking Speed of SynEdit Highlighters
Post by: sam707 on October 25, 2013, 04:58:51 pm

Quote from: Morton on October 25, 2013, 02:55:18 am

@sam707
Thanks.
I have been teached to program a 4 bit computer in machine language,
not assembler as it did not had enoug memory for an assembler.
It had no display only lamps.
1979 I got my fist with a single line display and a keyboard with letters.
My first PC 1983 was a Toshiba T300, with CPM and DOS on floppies,
without a disk.
prices had been grazy at this time.
The list prices of the t300 and a little softwares summed up to over 20000 DM.
I remember those tricks needed to do something with extremely little memory.

me too , it was an UC-EMR 4bits with 6 digits display.

I was so proud, after I worked a whole afternoon with octal tables as machine codes references on it, to make it display the word 'CHAIRS' on its lamps display, ancestor of LCD... hahahah

Even before that I learnt to program on a Texas Instrument SR-52 , with 223 steps of program only (I can call them by their names , see ;) ) and on that SR-52 with 223 steps (which wern't bytes) possible of programmation I played a "Lunar Landing game".

http://www.rskey.org/CMS/index.php/7?manufacturer=Texas+Instruments&model=SR-52 (http://www.rskey.org/CMS/index.php/7?manufacturer=Texas+Instruments&model=SR-52)

thx @Morton, we should please end, because I'm now feeling old and I hate to remember when I discovered an abacus between the knees of Cleopatra Queen :D

what I wanted to explain into my last postS, is the fact that in computing science, if you don't break traditions, you won't go ahead. My idea of a 'prediction" algorithm might be useful : it opens the door to a future code completion tool, that is cool to a highlighter.

combining dichotomic search on a constant ordered list that gives up instant when no more matching, plus keeping predictive track if you need future code completion, might be a cool clue.

I agree with you (and the others on that post) upon the fact that SynEdit was written years ago, following a tradition with hashcodes that sound useless to a highlighter.

Cheers

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 25, 2013, 06:42:04 pm

Quote from: sam707 on October 25, 2013, 04:58:51 pm

I agree with you (and the others on that post) upon the fact that SynEdit was written years ago, following a tradition with hashcodes that sound useless to a highlighter.

Thanks for your post.
For Delphi I could not find anything better.
Delphi is extremely good in inlining small methods even in in the method call tables.
In acknowledgement of my work, while I published regularly something,
I used to get free copies of professional tools.
Some with regular prices in thousands of $.
Among these was AQTime a very good binary profiler.
With it I got the CPU time spent on any piece of code with extreme precision.

BTH I think I will implement the macher with sets, just 4 bytes
instead of 256 for each dimension.
Thus lots more dimensions can be used.
with records or objects additional information can be provided vor each cell
and the matrix still keept much smaller
On my PC FPC can perform about 128 million insertions or deletions on sets per second and just accessing sets is even faster.

For small keys with few chars comparision could maybe not avoided.
The keys could be assigned to the cell for them, but maybe it can be avoided
by other means.
We will see.
If i was seriously enough behind something I could always find a solution.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 26, 2013, 05:53:59 am

Sets will make pretty good hashes.
Very fast to create a set for each string.
Store keyword and set in a list and sort it by two keys:
Key 1:
The set.
Key 2:
The string compared from the last char to the first,
which is often faster than the standard.
That allows very fast binary searches.
Strings need only to be compared if the sets are equal.
Divide the list internaly into sublists by length and last char.
would be the simplest possible solution, no need for a hash.
OK, thats a kind of hash too.
The matrix matcher implemented with sets would be small and allow
to extremely reduce the amount of comparisions.
Simple, clean and extremely fast solution, just as I like.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 26, 2013, 06:36:59 am

It sounds a very creative way for fast comparisons. I haven´t tested the sets for comparison, I suspect they are faster, but no idea.

I think you will need a fast way for transform a identifier to a Set, for no affect the performance.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 26, 2013, 07:09:15 am

Quote from: Edson on October 26, 2013, 06:36:59 am

I think you will need a fast way for transform a identifier to a Set, for no affect the performance.

As I noted in another post about 128 million insertions per second on my PC.
With an average length of 4 bytes that is thirty millions per second, with six still 20 million.

BTW a virtual subdivision of a sorted list can be done by additional comparisions levels.
Eg. First compare the lenght and only if equal compare the string

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 26, 2013, 09:22:07 am

Provided the sublists are short, the keywordlist of a highlighter that is realy divided into sublists and not just divided by sorting could be made selfoptimizing by counting the first 10000 accesses and storing the access count with the keys.
Then the sublists can be sorted by access counts.
Sublists will have to be searched linear.
Resorted whole list could be streamed to files,
for a scriptable HL into the scripts.
The aproach can be adjusted or lists just divided by sorting too.
After resorting for optimized access:
Use binary search to get the sublist, scann for a boundary, and then
search linear to the other boundary.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 26, 2013, 07:35:04 pm

Seems to be fixed now.
I had a nasty little logical bug.
Still about 20 million matches per second.

The match matrix is a memory hog as it is now.

Would be nice if there would be a BitBool only occupying a single Bit,
per member instead of 8.
Sets for 'A'..'Z' use 32 byte which is still far to much as only 4 would be required.
With bitwise manipulations 'A'..'Z' will fit into LongWord.
The Matrix could be much smaller.
Even smaller with a sparse implementation.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: marcov on October 26, 2013, 11:00:36 pm

Quote from: Morton on October 26, 2013, 07:35:04 pm

Sets for 'A'..'Z' use 32 byte which is still far to much as only 4 would be required.

That is not because sets use bytes, but because sets use from 0 to ord(high).

Playing with $packset should be able to cut it down to (64+26/8) rounded up. (probably delphi mode does too).

That the lower bound is always zero afaik can't be helped atm. Sparse sets help size, but make operations more expensive.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 26, 2013, 11:40:36 pm

Did you test the Matcher2 with

if IsKeyWord('interface') then Caption := Caption + '1';
if IsKeyWord('inherited') then Caption := Caption + '2';
if IsKeyWord('interited') then Caption := Caption + '3';

because it reports all 3 for me.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 27, 2013, 04:17:33 pm

Each letter in a keyword is initialized by index, letter at index and letter at index div 2.
Each letter in a string is checket by index, letter at index and letter at index div 2.
24 million matches per second.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 27, 2013, 05:06:15 pm

It give me "123" with:

if IsKeyWord('interface') then Caption := Caption + '1';
if IsKeyWord('inherited') then Caption := Caption + '2';
if IsKeyWord('interfaed') then Caption := Caption + '3';

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 27, 2013, 08:04:48 pm

or

if IsKeyWord('usit') then Caption := Caption + 'X';

Morton: The algorithm is flawed, however much you change it.
And even if you get a version that works for the current list of pascal identifiers (either which is well possible, the ones you use, or the ones in the SynPasSyn), it will not work for other lists.
Take a scriptable HL, and it will always be possible to find a set of 2 keywords that allow for a 3rd none keyword to be detected.

Which also means I will stop testing them....

Also the speed of an actual highlighter is not just the matching of keywords...

Besides:
- the hash is pretty good as it is
- the hash could be improved (e.g storing the keywords on the hashtable, together with the function ref, and compare before calling the function)
- a trie (prefix tree) could be used
- ...

But all that is only a part of the time used.

Scanning is only needed to the last visible line, after that it could be done on idle.

Scanning a big pascal file create 1500 ranges and more. They are in a tree, and are searched for each line.
Searching a tree with 1500 entries is of course nothing...But doing so 100-thousand times (for 100,000 lines) or more, and it gets noticeable.

This is because of some words (like deprecated) are only highlighted in very specific context. So ranges contain a lot of information, and therefore there are plenty of different ranges.
Yet a highlighter could be subdivided:
1) general structure:
unit interface implementation procedure begin end ....
This needs to be scanned from the start of file, to the last visible line
2) all the other details
only needed for the visible part. Find the start of a suitable range, just before the first visible line. And scan only what is needed.
Cache a certain amount around the visible area, for faster scrolling

You can do a simplified version of the above spilt:

Keywords like and,or,not => they do not change the state of the range
So you can have:
- a hashtable without them for scanning the entire file
- a hashtable with them, when scanning a line for getting the tokens to display.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 27, 2013, 09:17:47 pm

Morton: But we really appreciate your attemp.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 28, 2013, 02:13:58 pm

type

{ MatchItem }

MatchItem = object
MaybeTrue: Boolean;
Indexes: array of Smallint;
procedure SetTrue(const KeywordIndex: Smallint);
function Check: Boolean;
end;
var
KeyMatch: array of array of array[0..25,0..25] of MatchItem;
ListOfKeys: TStringList;

Each member of ListOfKeys has to be set if fails in the matrix are possible.
For most languages the list of possibly failing Keys is much shorter than
the count of ListOfKeys.

Only Keys with a minimum lenght can have fails under some conditions.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 28, 2013, 09:15:32 pm

Still 14 million matches per second.
List could be optimized as described in former posts.
Meybe even by just using a cheaper hash.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 28, 2013, 09:58:22 pm

How many matches per time does the original hash have?

That is:
- If you put it standalone, in a unit like the matcher
- If you put the actual keywords in the hashtable, so if the hash matches, they can be compared *without* (or before) calling a subroutine?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 29, 2013, 12:30:38 am

@Martin_fr it is just an example.
I have already outlined some means.
BTW by an average length of just about 3 chars the 14 million matches
translate to processing over 100 million bytes per second.
Included in a highlighter and optimized comparision for the keywords
it will be a lot faster.
computing a hash in the list for just the last 2-4 chars would be likely sufficient.
But other means outlined by me are more efficient.
Eg. sublists which contain keys with the same length.
Storing indexes in the matrix.
There are many ways to reduce comparisions.

I have a nasty cold since a week and I am a bit rusty as I havent coded for
11 years.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on October 29, 2013, 02:32:06 pm

Morton gave me some ideas.
I've hacked up a keyword match function that appears to be 3 to 4 times faster than the current synedit implementation, at least for bulk Pascal code as used in the attached timed comparison which locates files at Windows locations. (Is there a cross-platform routine to locate installed Lazarus directories like $(LazarusDir) that works outside the IDE?)
I've not tried 'profiling' before, so the test app may be flawed, nevertheless initial results look promising.
I've used sets (basically an array of booleans at the bit level) to reject non-keywords, combined with an over-simple hash to match possible keywords which are confirmed by comparing two actual characters.

Edit
I've removed the faulty attachment. See later message.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 29, 2013, 05:59:52 pm

@howardpc
I can not load the project.
Lazarus 1.0.12 Linux 64Bit.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on October 29, 2013, 06:55:46 pm

You can extract the zip OK?
On Linux you'll have to replace the hard-coded file locations which are correct only for Windows.
I just picked several large source files at random really.
Or add a TOpenDialog and locate a pas/pp file yourself to feed to the comparison routine as you did in your own example.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 29, 2013, 07:07:09 pm

I can compile on Windows, but it give me error on executing:

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on October 29, 2013, 09:48:14 pm

Sorry, previous project was developed on 64-bit non-release Lazarus version and it seems incompatible with release version.
I don't know why.
Delete the first download, and try the attached update, which I hope will not only compile but run for you.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: engkin on October 30, 2013, 12:01:05 am

If it did not work for you, then change the hard-coded path ;D
main_test.pp - line: 75

Code: Text [Select][+]

procedure TForm1.LoadAndParseTestFiles;
const Path = 'c:\lazarus\ide\';
 

Thanks Howardpc for sharing your code. Did not check it, yet.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 30, 2013, 02:02:39 am

It compile and run correctly for me, now.

Good job howardpc.

At first sight it seem to be very fast. I like the algorithm focus on rejects.

But if you use TSynPasSyn.IsKeyword() of "SynHighlighterPas", it doesn't use the tipical hash-algorithm. It use a simple "Find" on a "TStringList". It seem to be a extra utility for searching Keywords.

Would you compare with TSynPasSyn.IdentProc()? It is used for the highlighter, for searching KeyWords.

I don't know what Martin thinks.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: sam707 on October 30, 2013, 02:32:27 am

Quote from: Edson on October 30, 2013, 02:02:39 am

I like the algorithm focus on rejects.

Great ! ;) TY

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 30, 2013, 02:33:18 am

Well interesting.

But the compare is useless. SInce you are not comparing with what the highligther does in a normal scan.

(
And if you did, as I wrote before, the current hash, has an implementation flaw. If a hash is matched it should first compare with the potential word, and then call the sub. Yet it calls the sub and compares there. Thats wasted time.
So for a fair compare, you need to rewrite the hash in the HL first.
)

Also You compare words, that already know the len. But the HL does not have that luxury.

If the HL would know the wordlen upfront, it could use that.
But parsing the text into words is part of the time a HL needs to spent.

---------------
I am not going to see, if I can find a word that fails it (since that would be a brute force search... / and for the few pascal keywords it will likely not exist.)

But in a highlighter with user configured keywords, it will be possible to have a set of keywords, that allow a none keyword to be matched.
After all: Each of the individual checks is just a partial check.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on October 30, 2013, 08:27:59 am

Quote from: Martin_fr on October 30, 2013, 02:33:18 am

But the compare is useless. SInce you are not comparing with what the highligther does in a normal scan.
...
So for a fair compare, you need to rewrite the hash in the HL first.

Point taken. I'll look more carefully at the HL implementation to try to understand what it actually has to do, and see if I can offer any measureable performance improvement in that.

Two things I noticed in a quick scan of the relevant units. The keyword 'on' is listed correctly as a TP word, but then repeated in the Delphi list. The current docs (presumably the source of this) also have this same duplication, but it is clearly a typo.
Since FPC allows '@' as the first character of an identifier, I also wonder if TValidStringChars and the protected GetIdentChars method need to be updated.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 30, 2013, 10:02:31 am

Quote from: Edson on October 30, 2013, 02:02:39 am

But if you use TSynPasSyn.IsKeyword() of "SynHighlighterPas", it doesn't use the tipical hash-algorithm. It use a simple "Find" on a "TStringList".

A hashed list like my mwkeylist.pas for MatcherTest would do it more than 20 times faster on average.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 30, 2013, 10:41:50 am

Quote from: howardpc on October 30, 2013, 08:27:59 am

Two things I noticed in a quick scan of the relevant units. The keyword 'on' is listed correctly as a TP word, but then repeated in the Delphi list. The current docs (presumably the source of this) also have this same duplication, but it is clearly a typo.

Would you point me to more details? e.g. The documentation. The doc is fpc? So best to discuss on the fpc-mail list?

Quote

Since FPC allows '@' as the first character of an identifier, I also wonder if TValidStringChars and the protected GetIdentChars method need to be updated.

Does it? "@" is an operator. It can appear in front of an identifier. (With or without a space: OnClick := @ Foo; )

However the "&" is not yet handled

Code: [Select]

var
   &type: pchar;

type should not highlight.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 30, 2013, 11:29:00 am

Ok, after all, I took a look myself. Using a Prefix tree.

I have not compared it against the original hash. So no idea what is better.

Only compared against the code by HowardPC
On my system:
HowardPC 1295
Prefix 795

And the Prefix is a full compare.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on October 30, 2013, 11:56:40 am

Quote from: Martin_fr on October 30, 2013, 11:29:00 am

Only compared against the code by HowardPC
On my system:
HowardPC 1295
Prefix 795

The matcher with a full compare of the matches by a hashed list 14.6 million times per second.
The matcher with just checking the matches by the hashes in the hashed list 15.6 million times per second.
The matcher checket by a Tstringlist 0.7 million times per second.

The hasches are unique enough to omit a full compare, and because of this omitting will save only 7%-10%.

A hashed list with sublists would be much faster.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 30, 2013, 12:03:04 pm

Unfortunately the figures of HowardPC's test are in a different unit...

So I can not compare that...

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on October 30, 2013, 01:24:01 pm

Quote from: Martin_fr on October 30, 2013, 10:41:50 am

Would you point me to more details? e.g. The documentation. The doc is fpc? So best to discuss on the fpc-mail list?

I've already pointed it out to Michael van Canneyt.
See the two lists in the FPLanguageRef.pdf at sections 1.3.1 and 1.3.3 where 'on' is duplicated.

Quote

Does it? "@" is an operator. It can appear in front of an identifier. (With or without a space: OnClick := @ Foo; )
However the "&" is not yet handled

My mistake. As you surmised I meant '&' not '@'. I never use this facility and remembered the wrong prefix character.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on October 30, 2013, 01:53:55 pm

*r43342 SynEdit: Pas HighLighter, recognize new &keyword for identifier style

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on October 30, 2013, 04:47:48 pm

Quote

the current hash, has an implementation flaw

By now, we agree this, but we can use the current hash highlighter like a base, for evaluate our algorithms. It's supposed, we must be faster at the same conditions.

For testing, I suggest to implement a simple highlighter, and compare with a Lazarus highlighter. Differents computers have different speed. And like Martin said, string comparisons are just a part of the job that a highlighter must do.

You have to be carefull if using the current Lazarus Pascal highlighter. It includes folding too. So, comparing with a no-folding highlighter could be no so fair.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 02, 2013, 10:53:40 am

Optimized version of the matcher.
With full comparision it is 60% faster than the current Pas HL.
By comparing only the hashes in the list 75% faster.

Scanning idetifiers and checking for keywords is the slowest part of a HL.
Checking for folding doe not consume much time.

In the HL KeyHash can be improved too, to save needles comparisions.
about 20%-22% more speed.

function TcSynPasSyn.KeyHash: Integer; inline;
var
Start, ToHash: PChar;
begin
Result := 0;
if (fToIdent<fLineLen) then begin
Start := fLine + fToIdent;
ToHash := Start;
if IsLetterChar[ToHash^] then
begin
inc(Result, mHashTable[ToHash^]);
inc(ToHash);
{while (IsLetterChar[ToHash^] or IsUnderScoreOrNumberChar[ToHash^]) do
begin
inc(Result, mHashTable[ToHash^]);
inc(ToHash);
end;}
end;
if IsUnderScoreOrNumberChar[ToHash^] then
begin
Result := 0; { cannot be a Key}
inc(ToHash);
while (IsLetterChar[ToHash^] or IsUnderScoreOrNumberChar[ToHash^]) do
begin
inc(ToHash);
end;
end;
fStringLen := PtrUInt(ToHash) - PtrUInt(Start);
//if CompareText(copy(fLineStr,fToIdent+1,fStringLen),'varargs')=0 then debugln('TSynPasSyn.KeyHash '+copy(fLineStr,fToIdent+1,fStringLen)+'='+dbgs(Result));
end else begin
fStringLen := 0;
end;
end; { KeyHash }

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 02, 2013, 12:19:44 pm

Not sure what you mean by "optimized"? (well apparently "faster")

But not sure how it should be working?

You commented out the hash loop. So essentially it is a first char only compare?

So it is a different matcher again, not at all an optimized original...

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 02, 2013, 12:36:55 pm

@Martin_fr

IsKeywordbyHash compares only the hash values, but not the strings.
The computed hash is rather expensive, but creates unique hashes.
Because of the unique hashes the string comparision can be omitted.

Checking Keywords is fast despide the expensive hash,
because the matcher eliminates almost all comparisions.
A hash needs only to be computed for the small rest.

A hashed search array is a lot faster than search trees (trie).

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on November 02, 2013, 12:46:44 pm

Quote from: Morton on November 02, 2013, 12:36:55 pm

A hashed search array is a lot faster than search trees (trie).

However computing a hash is slow, so unless you present a routine that combines all the operations (hash computation and searching) to give a working IsKeyword function there is nothing concrete to compare beyond opinions.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 02, 2013, 12:52:38 pm

Do we talk about the piece of code you embedded in your post?

Or something else , hidden in your attachment? (I havent read the attachment, since I assume the quote in the past shows the relevant code?

Quote

but creates unique hashes.

You mean you found a perfect hash that works with an not-limited input? That can simple not be.

Some none pascal word will always give the same hash as one of the pascal tokens.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 02, 2013, 12:56:35 pm

Quote from: howardpc on November 02, 2013, 12:46:44 pm

Quote from: Morton on November 02, 2013, 12:36:55 pm
A hashed search array is a lot faster than search trees (trie).

However computing a hash is slow, so unless you present a routine that combines all the operations (hash computation and searching) to give a working IsKeyword function there is nothing concrete to compare beyond opinions.

The matcher eliminates needless comparisions close to 100%, a hash needs only to be computed for the very small rest.
The matcher is very cheap with regard to the hash in the search list.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 02, 2013, 01:23:02 pm

Quote from: Morton on November 02, 2013, 12:56:35 pm

A hashed search array is a lot faster than search trees (trie).

Most likely yes. So long as the hash is cheap.

We need to scan for end-of-word anyway. So the cpu needs to fetch the data anyway, and just adding up (and maybe doing a bitshift) the values in a cpu register (assumes really good optimization) will not add much time (maybe even no time at all, since the cpu can use the time of the add, to pre-fetch the next bit of data)

So given how modern cpu work, yes a good hash implementation can be the fastest choice.

The trie needs to access more data, and the cost of fetching this, may be bigger than the cost of the hash+compare.

Quote

Quote from: howardpc on November 02, 2013, 12:46:44 pm
However computing a hash is slow, so unless you present a routine that combines all the operations (hash computation and searching) to give a working IsKeyword function there is nothing concrete to compare beyond opinions.

Yes, we are still missing any good comparison. So far we got a lot of impressive, but meaningless times.

And yes the final compare of the keyword must be done in the same function that did the hash (I wrote that before, but it was apparently ignored)

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 02, 2013, 01:43:37 pm

Quote from: Martin_fr on November 02, 2013, 12:52:38 pm

Do we talk about the piece of code you embedded in your post?

I was sloppy. ( rendered myself a foo l)
But some time could be saved.

Quote

Quote
but creates unique hashes.
You mean you found a perfect hash that works with an not-limited input? That can simple not be.

Some none pascal word will always give the same hash as one of the pascal tokens.

If the hashes are equal then the comparision has never failed in my tests for pascal.
Not that much time to save by omitting the full comparision.
The list is derived from a List I wrote once to do fast searches in millions or even billions of items.
The list is scanning for double hashes if the comparisions fails.
Even in lists with hundred million of randomly generated strings a double hash occured only one time in
a huge lot of tests and it was for a doubled entry, not a double hash for different words.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 02, 2013, 01:49:33 pm

Quote from: Martin_fr on November 02, 2013, 01:23:02 pm

Yes, we are still missing any good comparison. So far we got a lot of impressive, but meaningless times.

The Matcher does a full comparision too, by calling the lists apropriate function.
Hashes are only computed inside the list.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 02, 2013, 02:02:07 pm

Ok, sounds good.

I need to find some time to read the attached code.

Working on other items now

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on November 03, 2013, 08:09:58 pm

Here's an alternative keyword matcher which does lookup using arrays of sets.
It's not faster than the synhighlighter tree-based algorithm (on my computer it's consistently about 2% slower), however it is simpler to implement, with source code less than half the length. I estimate its TDictionary structure occupies about 1.5 MB.
I'm not sure how to gauge the memory footprint of the synedit tree. Possibly it is less than 1MB?

Edit
I removed this attachment. See the improved version (speedwise) posted 4 Nov 2013

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 03, 2013, 08:48:13 pm

I am not sure how much mem the current tree uses.

Also the treedata itself (and the builder) are more complex than needed. The implement the aho corasick algorithm.

Just the search loop was reduced to ignore the extra.

The tree can be packed in various ways (none implemented), but then the search loop will be more complex, and take extra time.

I only added it, because everyone kept adding all kind of stuff.

I still think, that the hash (if done correctly) will be fastest.
That is in generic. Depends on a lot of factors. The hash will especially be faster, if there are many keywords.

If you have very very few keywords, then other algorithm may be as good as the hash, maybe even beat it by a tiny bit. (Just my guess).

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on November 03, 2013, 10:52:06 pm

howardpc:

Are you still comparing with TSynPasSyn.IsKeyword()?

It's not the hash rutine of the Lazarus Pascal Highlighter.

Quote

It's not faster than the synhighlighter tree-based algorithm (on my computer it's consistently about 2% slower)

Sorry, I'm lost. Which synhighlighter tree-based algorithm are you comparing?

Quote

however it is simpler to implement, with source code less than half the length

It's nice. A small code is good.

PD. I can't compile your code. Is it 64 bits?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on November 04, 2013, 02:50:05 pm

Quote from: Edson on November 03, 2013, 10:52:06 pm

Are you still comparing with TSynPasSyn.IsKeyword()?
It's not the hash rutine of the Lazarus Pascal Highlighter.

No, I do read Martin's posts! He indicated a while ago that such a comparison was useless.
The attached project compares the set-based matcher I came up with against the Aho algorithm tree-matcher implementation in the synedit code Martin adapted.
I've made a few minor optimisations to the code I'm offering, and the set-based matcher is now slightly faster than tree-matcher, at least using multiple iterations of the Lazarus source files used in the test.
I tried comparing the matchers using text with a much lower proportion of Pascal keywords (a chunk of the bible), and the performance of the two routines was almost identical.
For a Pascal keyword matcher you of course want an algorithm that performs well with a low keyword count, since Object Pascal is such a well-designed language (with only 73 keywords currently; and most programs use less than half of these).
The proportion of non-keywords in the data the routines process will cause variations in comparative speed. I'm assuming for our purposes that the data to be processed will normally have a fair proportion of the sought-after keywords.

Quote

I can't compile your code. Is it 64 bits?

You have to use a Lazarus compiled from SVN trunk (or a recent snapshot) otherwise the tree-matcher does not compile since it is based on fairly recent Synedit commits.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 04, 2013, 02:56:40 pm

Quote from: Martin_fr on November 03, 2013, 08:48:13 pm

I still think, that the hash (if done correctly) will be fastest.
That is in generic. Depends on a lot of factors. The hash will especially be faster, if there are many keywords.

The hash used in my hashed keylist is rather expensive.
Effects of cheaper hashes tends to be negative, as cheaper hashes will likely produce identical hashes for different strings.
A binary search for Pascal keys in my hashed keylist is about twenty times faster then the binary search of a TStringList.
The list of keys would have to be very very short to make the binary search of a TStringList equal or faster.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 04, 2013, 03:10:09 pm

Attached is a newer slightly optimized version of the matcher.
I have tested it against a lot sources.
If the hashes are equal then the comparision has never failed.

So far my assumption that a save key check without any real comparison is possible stays true.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: howardpc on November 04, 2013, 07:05:39 pm

Quote from: Morton on November 04, 2013, 03:10:09 pm

Attached is a newer slightly optimized version of the matcher.
I have tested it against a lot sources.
If the hashes are equal then the comparision has never failed.

Testing against the 9 Lazarus source files used in the project attached to my earlier post shows that this matcher incorrectly identifies the words - ID, Now, OI, dit, val, Val, dir - as keywords.

Quote

So far my assumption that a save key check without any real comparison is possible stays true.

I understand neither your assumption nor why it should be true (whatever it means).

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 12:29:49 pm

@howardpc

procedure TForm1.Button1Click(Sender: TObject);
begin
if IsKeyWord('ID') then
Caption:= Caption + ' 1';
if IsKeyWord('Now') then
Caption:= Caption + ' 2';
if IsKeyWord('OI') then
Caption:= Caption + ' 3';
if IsKeyWord('dit') then
Caption:= Caption + ' 4';
if IsKeyWord('val') then
Caption:= Caption + ' 5';
if IsKeyWord('dir') then
Caption:= Caption + ' 6';
if IsKeyWordByHash('ID') then
Caption:= Caption + ' 1';
if IsKeyWordByHash('Now') then
Caption:= Caption + ' 2';
if IsKeyWordByHash('OI') then
Caption:= Caption + ' 3';
if IsKeyWordByHash('dit') then
Caption:= Caption + ' 4';
if IsKeyWordByHash('val') then
Caption:= Caption + ' 5';
if IsKeyWordByHash('dir') then
Caption:= Caption + ' 6';
end;

Your statement is an obvious lie.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 05, 2013, 12:40:35 pm

If you rely only on a hash, comparing only the hash value and not compare the word, or anything else, then there will always be some word (word = combination of random letters) that will fail it.

This should be obvious:
- The hash has a fixed length. There is a fixed number of values itcan have (even if this is a very large number).
- Words in the text can be of any length. The amount of words is endless.

mapping an endless number to a fixed range, leads to double allocations. It has to.

For a given set of keywords, there may be a hash, that if compared together with the length, will not allocate any other words (of the same length) to any of the hash values of any keyword.

To prove you have such a hash, you must try ALL combinations of letters of the length of each keyword.

And if you ever have to add another keyword, well....

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 01:17:42 pm

Quote from: Martin_fr on November 05, 2013, 12:40:35 pm

If you rely only on a hash, comparing only the hash value and not compare the word, or anything else, then there will always be some word (word = combination of random letters) that will fail it.

function TForm1.RndLetter: Char;
var B: byte;
begin
Result:= Char(Random(25) + 97);
end;

function TForm1.RndLength: Integer;
begin
Result:= 0;
while Result < 2 do
Result:= Random(14);
end;

function TForm1.RndString: String;
var I, L: Integer;
begin
L:= RndLength;
SetLength(Result, L);
for I:= 1 to L do
Result:= RndLetter;
end;

procedure TForm1.Button1Click(Sender: TObject);
var I: Integer;
Stop: PChar;
S: String;
begin
aList.Clear;
for I:= 1 to 10000000 do
begin
S:= RndString;
if IsKeyWordByHash(PChar(S), Stop) then
aList.Add(S);
end;
SynEdit1.Lines:= aList;
end;

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 05, 2013, 02:01:14 pm

And the point of this? Did you actually read what I wrote. And (since you did not ask questions) Did you understand it?

But first of all, I am writing this under the assumptions, you still try to "proove" that comparing the hash value, but skipping the string compare) is a solution ?

If that is not the case, then please make it clear.

I wrote

Quote

For a given set of keywords, there may be a hash, that if compared together with the length, will not allocate any other words (of the same length) to any of the hash values of any keyword.

To prove you have such a hash, you must try ALL combinations of letters of the length of each keyword.

Quote

And if you ever have to add another keyword, well....

So even if you ignore the last sentence (but why ignore it?)

What good is your code, that tests a subset of combinations ?

And if you did NOT ignore the last sentence of my quote:
What good would it be, to show it works with the current set of keywords, if tomorrow there will be a new keyword added (e.g. fpc adds a new modifier, like "deprecated", that gets highlighted, so must be added).
And then that new keyword may break it, and you need a new hash algorithm?

What good is all that?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 02:39:16 pm

Quote from: Martin_fr on November 05, 2013, 02:01:14 pm

And the point of this? Did you actually read what I wrote. And (since you did not ask questions) Did you understand it?

Looks rather like you do not understand.
It is plain and simple mathematics.

The maximal amount of different strings for a given range of lenths:
(number of different chars) * (maximal length) * (medium length).

For any given set of keywords:
The maximal amount of different strings determined by max and min length will be just a few thousands.

One has just to select a hash method that will never create the same hasch
for just a few thousands possible strings.

Every mathematician will confirm this is not an impossible task.

The principles of genetic programming could be used to compute the cheapest
possible hash method that can fulfill the requirment.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 05, 2013, 02:59:56 pm

Quote from: Morton on November 05, 2013, 02:39:16 pm

The maximal amount of different strings for a given range of lenths:
(number of different chars) * (maximal length) * (medium length).

For any given set of keywords:
The maximal amount of different strings determined by max and min length will be just a few thousands.

So you also compare the length, in addition to the hash value?

Then why did you not answer to:

Quote

But first of all, I am writing this under the assumptions, you still try to "proove" that comparing the hash value, but skipping the string compare) is a solution ?

Well yes, I did not explicitly exclude the length. But I would still think that this was a clear "hash value only, nothing else")

And just to be clear, adding the length to the hash value, by putting it for example in the upper byte (that is left empty by the rest of the hash => that counts still as 2 values in this case)

Yes if you do compare the length too, it is possible.

Yet (As I already wrote) leaves you to the danger, of having to change the hash algorithm, each time a keyword is added.

If you honour speed that much over maintainability

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 03:26:05 pm

Quote from: Martin_fr on November 05, 2013, 02:59:56 pm

So you also compare the length, in addition to the hash value?

NO.

Quote

Then why did you not answer to:
Quote
But first of all, I am writing this under the assumptions, you still try to "proove" that comparing the hash value, but skipping the string compare) is a solution ?

I did.

Quote

If you honour speed that much over maintainability

It is maintainable.

Assumed one would be stupid enough to create a language with keys of up to 100 letters,
that could contain every of the 256 chars.
The result would be about 1.3 million possible strings.

The principles of genetic programming could be used to compute the cheapest
possible hash method that can fulfill the requirment .

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 05, 2013, 03:37:20 pm

Quote from: Morton on November 05, 2013, 03:26:05 pm

Quote from: Martin_fr on November 05, 2013, 02:59:56 pm

So you also compare the length, in addition to the hash value?

NO.

Then I don't understand.

Quote

The maximal amount of different strings for a given range of lenths:
(number of different chars) * (maximal length) * (medium length).

But if you do not check the length....

The keywords may have a maximum length.

But other words are limited only by memory and disk space

Procedure NameWithABout500CharsOfLengthFOOFOFFBARxxxxxxxxxxx....

and longer. Those can happen. How do you know, they will not create a hash value, equal to that of a keyword?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 03:50:26 pm

Quote from: Martin_fr on November 05, 2013, 03:37:20 pm

But other words are limited only by memory and disk space

Does not matter.
Hashes will only be created for strings within the min and max lenths of the keywords.
The matcher will exclude close to 100% of the posible non keyword strings in this range before hashes are generated.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 05, 2013, 04:02:58 pm

Quote

Hashes will only be created for strings within the min and max lenths of the keywords.

Quote from: Morton on November 05, 2013, 03:26:05 pm

Quote from: Martin_fr on November 05, 2013, 02:59:56 pm

So you also compare the length, in addition to the hash value?
NO.

Those 2 statements directly contradict each other.

You do compare the length (compare is not limited to equality, it can also include range.)

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 04:12:46 pm

Quote from: Martin_fr on November 05, 2013, 04:02:58 pm

Quote
Hashes will only be created for strings within the min and max lenths of the keywords.

Quote from: Morton on November 05, 2013, 03:26:05 pm
Quote from: Martin_fr on November 05, 2013, 02:59:56 pm

So you also compare the length, in addition to the hash value?
NO.

Those 2 statements directly contradict each other.

No they do not, as the length is only used for exclusion, not for comparing.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 05, 2013, 04:19:03 pm

Lets agree to disagree.

All else seems to lead nowhere.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 04:31:31 pm

Quote from: Martin_fr on November 05, 2013, 04:19:03 pm

Lets agree to disagree.

All else seems to lead nowhere.

It is plain and simple mathematics.
If you cannot do the mathematics yourself then
admit it and ask a mathematician if my claim is true.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 05, 2013, 04:52:20 pm

Ok, I have tried to keep this professional. If I have failed at any point up to here, I shall apologize.

As for your last statement (and the professional part ends here, no apologies for the following will be given):

I do not need to defend my skills. Those who know me (and I) will have their own judgement about my abilities.

As for you, if you are unable to close an argument without malicious insinuation about the other, then you have done well.

Please feel free to add any further reply. I shall disregard it.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 05:02:35 pm

@Martin_fr
So far you have not made any serious attempt to prove me wrong.
Does not look professional for me.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on November 05, 2013, 05:07:37 pm

@Morton

Can you show a little of humilty?

We are trying to be patient with you.

For all the time I know, Martin is very professional.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on November 05, 2013, 05:16:13 pm

By the way, I'm good on mathematics (but not in English) and your formula is wrong:

The maximal amount of different strings for a given range of lenths is not:

(number of different chars) * (maximal length) * (medium length).

It is:

(number of different chars)^n + (number of different chars)^(n+1) + ... + (number of different chars) ^m

Where:
n -> minimal length
m -> maximal length

And it grows to the infinite with just a little "n" and "m".

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 05:39:49 pm

Quote from: Edson on November 05, 2013, 05:16:13 pm

By the way, I'm good on mathematics (but not in English) and your formula is wrong:

At last one has noticed my mistake, which I did not by accident.
It clearly showed that there was not made any attempt to follow my arguments,
eg. by checking my formula.
Yours is wrong too.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 05, 2013, 06:13:45 pm

Lets ignore the question, if "excluding words of certain length" is "taking the length into comparison" or not. It does not matter.

What matters is, that it lead to a misunderstanding, and that some of my statements while this misunderstanding lasted, where based on that words of any length would be hashed.

Yes I did not read the formula. For all I can see, the exact formula does not matter.

What matters is, for any hash that results in a hash value in a finite range (e.g. the hash value is a 32 or 64 bit integer):

1) If words of arbitrary length (and that would mean an infinite amount of possible words) would be hashed, then to any chosen word, there may be another word with the same value.

2) If (as now clarified) only words within a specified range of lengths are used, then the amount of words is finite [1] too. Within such are finite list of words (even if the list has more entries, than there are hash values), a hash can be found, that will for a given subset of those words (the pascal identifiers) produce hash values, that are not produced by any other word in the full list.

[1] and finite is all that matters, hence I did not bother with the formula, and still do not.

I have not tested your hash, if it fulfils the 2nd point. It may well do.
Your test with using "random" is not a guarantee, but could be modified to iterate all possible words.

----------------
Remains the point, what happens if a new keyword is added.

I assume you use an int32 or int64? In which case the total amount of words is bigger than the amount of hash values, so outside the current keyword-list there are clashes (but they do not harm).

If the new keyword is found to have a clash with any other random word from the full list, then you need to find a new hash algorithm (or enable comparing the word itself, but that was what you try to avoid).

To me, the chance that I may have to redo the hash algorithm, falls under undesired maintenance cost.
On that you may disagree, but then everyone has to make there own decision what they see as maintainable (so long as it is ones own code / in a team different rules apply)

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 06:26:36 pm

Sorry Edson, but again you are not nearly as smart as you belive.

Example:
Lenght 2: (number of different chars) * 2 eg. 26*2 = 52.
Lenght 3: (number of different chars) * 3 eg. 26*3 = 78
Lenght 4: (number of different chars) * 4 eg. 26*4 = 104
Lenght 5: (number of different chars) * 5 eg. 26*5 = 130
Lenght 6: (number of different chars) * 6 eg. 26*6 = 156

Sumary 520

(number of different chars) * (maximal length) * (medium length).
Eg. 26 * 6*4 = 624.

The result tends to be rather to large.
which makes the formula a safe approximation :D

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 06:32:57 pm

Quote from: Martin_fr on November 05, 2013, 06:13:45 pm

Your test with using "random" is not a guarantee, but could be modified to iterate all possible words.

Would be a bit less than (number of different chars) * (maximal length) * (medium length).
For Pascal less than 2548 different strings, including the keywords.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on November 05, 2013, 06:39:34 pm

@Morton,

Review the formula:

Quote

(number of different chars)^n + (number of different chars)^(n+1) + ... + (number of different chars) ^m

The symbol "^" mean exponent, "powered at".

I wanted to express that like a summatory of a serie, but it would have more confused.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on November 05, 2013, 06:45:03 pm

Example:

Lenght 2: (number of different chars) ^ 2 eg. 26^2 = 676.
Lenght 3: (number of different chars) ^ 3 eg. 26^3 = 17576
Lenght 4: (number of different chars) ^ 4 eg. 26^4 = 456976
Lenght 5: (number of different chars) ^ 5 eg. 26^5 = 11881376
Lenght 6: (number of different chars) ^ 6 eg. 26^6 = 308915776

Sumary A LOT!!!!!

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Martin_fr on November 05, 2013, 07:04:38 pm

Quote from: Morton on November 05, 2013, 06:32:57 pm

Quote from: Martin_fr on November 05, 2013, 06:13:45 pm
Your test with using "random" is not a guarantee, but could be modified to iterate all possible words.

Would be a bit less than (number of different chars) * (maximal length) * (medium length).
For Pascal less than 2548 different strings, including the keywords.

Please see Edson's formula.
implementation = 14 char long

26^14 = 64509974703297150976 (if I copied it correctly from my calculator)

+ 26^13 + 26 ^ 12 ....

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 07:14:26 pm

Quote from: Edson on November 05, 2013, 06:39:34 pm

Lenght 2: (number of different chars) ^ 2 eg. 26^2 = 676.
Lenght 3: (number of different chars) ^ 3 eg. 26^3 = 17576
Lenght 4: (number of different chars) ^ 4 eg. 26^4 = 456976
Lenght 5: (number of different chars) ^ 5 eg. 26^5 = 11881376
Lenght 6: (number of different chars) ^ 6 eg. 26^6 = 308915776

Sumary A LOT!!!!!

ROFL .
That realy hurts. :'(

Did you never have learned to use your brain and not to rely on things you have swallowed once without reflection.

Close to 100% (more than 99.99% ) can be excluded before a hasch needs to be computed.

The correct formula for the amount of those who cannot be as easily excluded is:
(number of different chars) * (maximal length) * (((maximal length) div 2)+1) -
(number of different chars) * (minimal length-1) * (((minimal length-1) div 2)+1) .

But this amount can be reduced significantly too.

Edson: I would reccoment you to use your brain and not simply repeat formulas without thinking, which someone has trown towards you at school.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 07:20:48 pm

Quote from: Martin_fr on November 05, 2013, 07:04:38 pm

Quote from: Morton on November 05, 2013, 06:32:57 pm
Quote from: Martin_fr on November 05, 2013, 06:13:45 pm
Your test with using "random" is not a guarantee, but could be modified to iterate all possible words.

Would be a bit less than (number of different chars) * (maximal length) * (medium length).
For Pascal less than 2548 different strings, including the keywords.

Please see Edson's formula.
implementation = 14 char long

26^14 = 64509974703297150976 (if I copied it correctly from my calculator)

+ 26^13 + 26 ^ 12 ....

I would realy have expected more from someone insisting to be a professional.

That is a moronic repetition of formulas once swallowed without understanding.
Oh Herr schmeiß hirn ra.
Oh Lord throw some brain down here.

Is it too much to ask, first to reflect by using some real brainpower.
I cannot see any professional approach here.
What I see is plainly childish behaviour.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Edson on November 05, 2013, 07:43:23 pm

Morton, When are you going to accept you are wrong.?

Failed in your code (several times), failed in your formula, and still insult us.
For respect, I am not going to insult you.

I prefer to ignore all your attacks, and no waste my time checking your code.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 05, 2013, 08:20:42 pm

Quote from: Edson on November 05, 2013, 07:43:23 pm

Failed in your code (several times)

I was just hoping someone would see the obvious solution.
I am a bit lazy.
If you do not check, then you may miss an opportunity to learn something.
But that is not my problem.

With the 1 million random strings:
On average ItemOfByHash is only called about three times more as keywords have been found.
Pretty good ratio, which would be better for real code.
Obviously my formula does not fail,
the computed amount is rather much to large.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 06, 2013, 01:16:41 pm

On real code ItemOf or ItemOfByHash are called only a few times more
than keywords are found.
For a file containing at least one occurrence of every keyword there will likely be
never more than 100 different hashes created, including the hashes for the keywords.

For readability identifiers have to follow the rules of human languages,
which reduces the amount of possible identifiers a huge lot.
The list of keywords creates another set of rules for identifiers which could be possibly keywords.
Finally there will be only slightly more identifiers which could be possibly keywords,
than keywords.

For such a small amount it is possible to create unique hashes. :D

For all this very obvious reasons 26^14 = 64509974703297150976 is just ridiculous.
I cannot understand how someone could ever come up with this.

Sorry, should have been ItemOf or ItemOfByHash instead of IsKeyword or IsKeywordByHash.
Fixed.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 07, 2013, 04:52:39 pm

No comments ?
Has anyone had a close look at the last version of the matcher ?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 11, 2013, 07:49:51 pm

I have modified the pascal highlighter to use the matcher and the keylist
instead.
Somewhere I made a mistake which prevents correct folding of subroutines.
The code becomes more readable, as only one key is handled per event.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: irfanbagus on November 12, 2013, 07:20:42 am

@Morton

this is what i did :

1. reduce your hash function output to smallint.
reason : my pc not fast enough to search all combination with 32bit (cardinal) output.
but i also did :
2. use maximum keyword length 5 chars. pascal keyword need 14 char
3. find hash for keyword 'end' (12251)
4. try to find hash for all 2-5 chars ('a'-'z') combination which equal to 12251.

result :
find 185 combination

uqdc=12251
wwje=12251
....
....
....
azlrz=12251
dfzwz=12251

with your original hash function (cardinal output) did not make any difference because all keyword combinations (14 char) bigger than cardinal capacity.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 12, 2013, 07:58:39 am

Quote from: irfanbagus on November 12, 2013, 07:20:42 am

@Morton

this is what i did :

1. reduce your hash function output to smallint.
reason : my pc not fast enough to search all combination with 32bit (cardinal) output.
but i also did :
2. use maximum keyword length 5 chars. pascal keyword need 14 char
3. find hash for keyword 'end' (12251)
4. try to find hash for all 2-5 chars ('a'-'z') combination which equal to 12251.

result :
find 185 combination

uqdc=12251
wwje=12251
....
....
....
azlrz=12251
dfzwz=12251

with your original hash function (cardinal output) did not make any difference because all keyword combinations (14 char) bigger than cardinal capacity.

None of these combinations will pass the matrix
and thus there will never be a hash generated for them.
Never learned to read?

Title: Re: Checking Speed of SynEdit Highlighters
Post by: irfanbagus on November 12, 2013, 08:52:42 am

Quote from: Morton on November 12, 2013, 07:58:39 am

None of these combinations will pass the matrix
and thus there will never be a hash generated for them.
Never learned to read?

your code said otherwise. i read your code and you should did too.

i am done. bye.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 12, 2013, 09:01:04 am

Gods themselves struggle in vain against stupidity.
Life is to short for such a waste of time.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 12, 2013, 09:44:51 am

Quote from: irfanbagus on November 12, 2013, 08:52:42 am

Quote from: Morton on November 12, 2013, 07:58:39 am
None of these combinations will pass the matrix
and thus there will never be a hash generated for them.
Never learned to read?

your code said otherwise. i read your code and you should did too.

i am done. bye.

you know, and can nothing.
Learn programming first, but likely you will fail.
How can one dare to show one's lack of everything in the public ?

procedure TForm1.Button1Click(Sender: TObject);
begin
Label1.Caption:= IntToStr(HashSmallInt(Edit1.Text));
Label2.Caption:= IntToStr(HashCardinal(Edit1.Text));
Label3.Caption:= IntToStr(HashQWord(Edit1.Text));
end;

function TForm1.HashSmallInt(const S: String): SmallInt;
var
P, P2: PChar;
begin
P:= PChar(S);
P2:= P + Length(S);
Result := 0;
while P < P2 do
begin
Result := Result shl 10 - Result shl 5 - Result + Ord(P^);
inc(P);
end;
end;

function TForm1.HashCardinal(const S: String): Cardinal;
var
P, P2: PChar;
begin
P:= PChar(S);
P2:= P + Length(S);
Result := 0;
while P < P2 do
begin
Result := Result shl 10 - Result shl 5 - Result + Ord(P^);
inc(P);
end;
end;

function TForm1.HashQWord(const S: String): QWord;
var
P, P2: PChar;
begin
P:= PChar(S);
P2:= P + Length(S);
Result := 0;
while P < P2 do
begin
Result := Result shl 10 - Result shl 5 - Result + Ord(P^);
inc(P);
end;
end;

BTW for QWord the hash could be made much faster by first simply moving up to 8 chars
into the QWord and starting bitwise from the ninth.
A set could take even 32 chars.
Would work best for case sensitive languages,
but insensitive would be no problem too.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: JanRoza on November 12, 2013, 12:44:31 pm

Maybe it's time a moderator took a look at this topic as I think some people really show they have no manners at all.
Is it so hard to discuss something without insulting the other?
Morton, please learn some manners and try to behave in this forum.
The way you behave now I think more and more people will just ignore you as they have had enough of your insults.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 12, 2013, 01:01:20 pm

Quote from: JanRoza on November 12, 2013, 12:44:31 pm

Maybe it's time a moderator took a look at this topic as I think some people really show they have no manners at all.
Is it so hard to discuss something without insulting the other?
Morton, please learn some manners and try to behave in this forum.
The way you behave now I think more annd more people will just ignore you as they have had enough of your insults.

The code above shows that irfanbagus has insulted me and himself.
Obviously he does not know that the result of bitwise manipulation of data types depends on the data types.
Thats very very basic knowledge.
irfanbagus has in very clear words told me that I am a moron,
but without being aware he has told something not flattering about himself.

I have only stated the obvious.
That is by no means an insult.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: BBaz on November 12, 2013, 06:53:43 pm

I would not like to invalidate 20 pages of discussions in a forum but I think that you forgot 1 thing:
- lexing/parsing in an text editor (to get the lexical token) is not the same as lexing/parsing in a compiler (to get the grammatical syntax tree...)

1: in a text editor: when you change a char, only a page is re-lexed/re-parsed, in synEdit it'll also depend on the ranges
2: in a compiler: the whole stuff is lexed..tokenized...parsed...

So in synEdit, performances are not so important...many times when you change a line the whole document will not be rescanned...
For example, if you are editing a 50 000 linee sourcecode file, and if you change a char, do you expect the whole stuff to be rescanned ?
>:( ...
No of course... only the screen range will be rescanned and maybe a bit of code before...
::)

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 12, 2013, 07:04:30 pm

Quote from: BBaz on November 12, 2013, 06:53:43 pm

>:( ...
No of course... only the screen range will be rescanned and maybe a bit of code before...
::)

Usually only a single line will be scanned,
but some chars will cause a scan from the modified line to the end of the whole code.

The thread was started by one who felt urged to show how smart he is.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: BBaz on November 12, 2013, 07:25:54 pm

Quote from: Morton on November 12, 2013, 07:04:30 pm

Quote from: BBaz on November 12, 2013, 06:53:43 pm
>:( ...
No of course... only the screen range will be rescanned and maybe a bit of code before...
::)

Usually only a single line will be scanned,
but some chars will cause a scan from the modified line to the end of the whole code.

The thread was started by one who felt urged to show how smart he is.

It's not right for example if I type {*, the whole doc doesn't need to be analyzed, if the text displayed on my screen doesn't include the *}...
Because idealy the highlighter will have set the flags for: "scanStringEnd, scanComment End etc...)...

Title: Re: Checking Speed of SynEdit Highlighters
Post by: BBaz on November 12, 2013, 08:08:04 pm

Ok...I've read this:

"The thread was started by one who felt urged to show how smart he is.".

It's ok ok.what I mean is that he makes some obstruse tests on a whole document
while the methods used are only designed to work on 20, 40 or 80 lines let's say.
A syn HL is not designed to scan the whole doc...each time yout type... 8)

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 12, 2013, 08:10:03 pm

Quote from: BBaz on November 12, 2013, 07:25:54 pm

Quote from: Morton on November 12, 2013, 07:04:30 pm
Quote from: BBaz on November 12, 2013, 06:53:43 pm
>:( ...
No of course... only the screen range will be rescanned and maybe a bit of code before...
::)

Usually only a single line will be scanned,
but some chars will cause a scan from the modified line to the end of the whole code.

The thread was started by one who felt urged to show how smart he is.

It's not right for example if I type {*, the whole doc doesn't need to be analyzed, if the text displayed on my screen doesn't include the *}...
Because idealy the highlighter will have set the flags for: "scanStringEnd, scanComment End etc...)...

Could be done, but the scan does not take nearly as much CPU resources
as just querying for mouse and keyboard events, while processing eg. selections.
Most code can be scanned thousand times per second.
Even 1 MB code can be scanned 30 - 70 times per second,
depending on the average token length.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 13, 2013, 07:03:13 pm

Regardless how many times I have been told to be a moron
from people who have rendered themself as clueless and usually
completely incompetent too: The code just works.
I gave you something that allows to write much cleaner
and easier to maintain code for highlighters.

Take it or let it, I do not care.
It was allways my stance people could do with my code everything they are capable of,
which wasn't usually that much with few exceptions

Title: Re: Checking Speed of SynEdit Highlighters
Post by: JanRoza on November 14, 2013, 12:34:12 am

Quote

It was allways my stance people could do with my code everything they are capable of,
which wasn't usually that much with few exceptions

I think your tone is very arrogant.
I do not wish to be called incompetent by someone who does not know anything about my capabilities.
This kind of remarks tells more about your manners than it does about other forum members.
You'll probably disagree as you seem to think your the smartest person in the world but just forget about programming for a while and work on your manners.
Maybe then you'll get more positive response from other forum members.

These will be the last words I spend on you, there is more pleasant reading elsewhere on this forum.

Title: Re: Checking Speed of SynEdit Highlighters
Post by: Morton on November 14, 2013, 04:14:48 pm

Quote from: JanRoza on November 14, 2013, 12:34:12 am

[I think your tone is very arrogant.

ROFLMAO.

You do not even know the meaning of "arrogant".
I have just expressed confidence.
I do not adorn myself with borrowed plumes.