Recent

Author Topic: fyi benchmarks game k-nucleotide  (Read 16060 times)

AlexK

  • Jr. Member
  • **
  • Posts: 55
Re: fyi benchmarks game k-nucleotide
« Reply #15 on: May 31, 2016, 06:29:07 am »
I changed hash functor to a simple:

Code: Pascal  [Select][+][-]
  1. class function hashfunctor.hash(key : hashkey; buckets : sizeUint) : sizeUint; inline;
  2. begin
  3.    result := key.data mod buckets;
  4. end;

But the code is still very slow.

Reading a 25 Mb file from stdin into an ansistring takes 0.6 secs on my machine.
C++ does it almost immediately with std::string, 0.08 secs.

C++ memory management and data structures are optimized to astounding levels.

Now I'm checking if refcounting of ansistring slows things down in a function that is called many million times.


Also I couldn't understand why does THashMap return 1 for the key that doesn't exist in hashmap.
I tried to find something about this undefined behaviour in GetData function of THashMap:

Code: Pascal  [Select][+][-]
  1. function THashmap.GetData(key:TKey):TValue;inline;
  2. var i,h,bs:longint;
  3. begin
  4.    h:=Thash.hash(key,FData.size);
  5.    bs:=(FData[h]).size;
  6.    for i:=0 to bs-1 do begin
  7.       if (((FData[h])[i]).Key=key) then exit(((FData[h])[i]).Value);
  8.    end;
  9. end;

If bs is 0, then the for loop is not executed, so what is the return value of this function? What is the Result? Couldn't find in any docs for fpc or delphi.
« Last Edit: May 31, 2016, 06:46:22 am by AlexK »

molly

  • Hero Member
  • *****
  • Posts: 2330
Re: fyi benchmarks game k-nucleotide
« Reply #16 on: May 31, 2016, 07:02:58 am »
That is supposed to be a library unit with a generic type. A library doesn't know about your unit. You are using that library, not library uses your unit.
Fair enough.

Quote
I explained it in initial comments.
It was not clear to me that the specialization was not allowed to be part of the unit containing the generic (which is another possible solution). Although i realize that would work restricting. It was not clear for me from reading the assignment descriptions alone.

Quote
You can only specialize library generic with your type in your unit. For operator overloading to work, it must be a part of the type you specialize library generic with, not a part of a unit's interface.
As mentioned above, the specialization can also take part in the generics unit itself but that does work restricting in that you can't do additional specialization with using global operators (unless you incorporate those operators again). It is the nature of how globals work: you are required to include them to be able to use them.

That's why i wrote that i don't recommend using it but that it is possible to do (depending on the situation).

Apologies for my mix-up and the corresponding erroneous follow ups.

AlexK

  • Jr. Member
  • **
  • Posts: 55
Re: fyi benchmarks game k-nucleotide
« Reply #17 on: May 31, 2016, 07:11:46 am »
I doubt that THashMap will be able to cope with such aggressive key insertion.

But in regard to the input reading, is this how it's usually done:

Code: Pascal  [Select][+][-]
  1. procedure main;
  2. var
  3.    buf : shortstring;
  4.    inStr : ansistring;
  5.    //i : byte;
  6. begin
  7.    //for i := 0 to length(tochr)-1 do begin tonum[ord(tochr[i])] := i; end;
  8.    repeat readln(buf) until AnsiStartsStr('>THREE', buf); // seek
  9.    repeat
  10.       readln(buf);
  11.       if buf[0] <> ';' then inStr := inStr + buf;
  12.    until eof(input);
  13.    inStr := upcase(inStr);
  14. end;
  15.  
  16. begin
  17.    main;
  18. end.

From C++:

Code: C  [Select][+][-]
  1. int main()
  2. {
  3.   init();
  4.   std::string input;
  5.   char buffer[256];
  6.   while (fgets(buffer,100,stdin) && memcmp(">THREE",buffer,6)!=0);
  7.   while (fgets(buffer,100,stdin) && buffer[0] != '>')
  8.     {
  9.       if (buffer[0] != ';')
  10.         {
  11.           input.append(buffer,strlen(buffer)-1);
  12.         }
  13.     }
  14.   std::transform(input.begin(),input.end(),input.begin(),::toupper);
  15. }


UPD:

Looked at reverse-complement benchmark, optimization is to use pchar buffers with getmem and reallocmem.
« Last Edit: May 31, 2016, 09:48:21 pm by AlexK »

AlexK

  • Jr. Member
  • **
  • Posts: 55
Re: fyi benchmarks game k-nucleotide
« Reply #18 on: May 31, 2016, 08:16:32 am »
It is the nature of how globals work: you are required to include them to be able to use them.
That's why i wrote that i don't recommend using it but that it is possible to do (depending on the situation).

Overloaded operators for records are standalone objects that can be exported in interface section, independent from the record types they operate on. Something like that.
As I noted, official documentation omits this idiosyncrasy.
But.. On the other hand records are not OOP, they don't receive messages, they are operated on. So this approach with operator overloading for records is explainable.

{$modeswitch advancedrecords} does it the right way. Operator overloading is defined by a record's type(like classes).
User can specialize imported library generic by such record and when those library generic methods are compiled in their separate unit, they "see" that the record type overrides operators.
« Last Edit: May 31, 2016, 09:49:56 pm by AlexK »

serbod

  • Full Member
  • ***
  • Posts: 142
Re: fyi benchmarks game k-nucleotide
« Reply #19 on: May 31, 2016, 09:22:39 am »
Try TStringHash from IniFiles unit.

Memory allocation take time, allocate memory for 1000 records at once is much faster, than do 1000 allocations for each record. Do not resize arrays, strings or memory chunks - it cause data move.

Use SmallString or String[nn] type, they stack-based and don't use reference counting.

Use 'const' or 'var' prefix for string or record parameter, it give pointer to data instead of copy.
« Last Edit: May 31, 2016, 09:28:20 am by serbod »

serbod

  • Full Member
  • ***
  • Posts: 142
Re: fyi benchmarks game k-nucleotide
« Reply #20 on: May 31, 2016, 02:35:44 pm »
My implementation from scratch:
Code: Pascal  [Select][+][-]
  1. program knucleotide;
  2.  
  3. uses SysUtils, Classes, IniFiles;
  4.  
  5. type
  6.   TNuclCountRec = record
  7.     Name: string;
  8.     Count: Integer;
  9.   end;
  10.   TNuclCountArr = array of TNuclCountRec;
  11.  
  12. var
  13.   shNucl: TStringHash;
  14.   NCArr: TNuclCountArr;
  15.   NCArrLen: Integer;
  16.   Total1, Total2: Integer;
  17.  
  18. procedure CountSeq(const sSeq: String);
  19. var
  20.   n: integer;
  21. begin
  22.   n := shNucl.ValueOf(sSeq);
  23.   if n = -1 then
  24.   begin
  25.     n := NCArrLen;
  26.     Inc(NCArrLen);
  27.     SetLength(NCArr, NCArrLen);
  28.     NCArr[n].Name := sSeq;
  29.     NCArr[n].Count := 1;
  30.     shNucl.Add(sSeq, n);
  31.   end
  32.   else
  33.   begin
  34.     Inc(NCArr[n].Count);
  35.   end;
  36. end;
  37.  
  38. procedure CountFrame(Len: Integer; const sFrame: ShortString);
  39. var
  40.   i, n: integer;
  41.   s: string;
  42. begin
  43.   n := Length(sFrame);
  44.   for i := 1 to Len do
  45.   begin
  46.     s := Copy(sFrame, i, 1);
  47.     CountSeq(s);
  48.     Inc(Total1);
  49.  
  50.     if i < n then // last pair
  51.     begin
  52.       s := Copy(sFrame, i, 2);
  53.       CountSeq(s);
  54.       Inc(Total2);
  55.     end;
  56.  
  57.     s := Copy(sFrame, i, 18);
  58.     if Pos('GGT', s) = 1 then CountSeq('GGT');
  59.     if Pos('GGTA', s) = 1 then CountSeq('GGTA');
  60.     if Pos('GGTATT', s) = 1 then CountSeq('GGTATT');
  61.     if Pos('GGTATTTTAATT', s) = 1 then CountSeq('GGTATTTTAATT');
  62.     if Pos('GGTATTTTAATTTATAGT', s) = 1 then CountSeq('GGTATTTTAATTTATAGT');
  63.   end;
  64. end;
  65.  
  66. function SortItems(Item1, Item2: Pointer): Integer;
  67. begin
  68.   Result := TNuclCountRec(Item2^).Count - TNuclCountRec(Item1^).Count;
  69. end;
  70.  
  71. procedure ExportFreq(n, TotalCount: Integer);
  72. var
  73.   i: Integer;
  74.   d: Double;
  75.   lst: TList;
  76. begin
  77.   lst := TList.Create();
  78.   try
  79.     for i := 0 to NCArrLen-1 do
  80.     begin
  81.       if Length(NCArr[i].Name) <> n then Continue;
  82.       lst.Add(@NCArr[i]);
  83.     end;
  84.  
  85.     lst.Sort(@SortItems);
  86.  
  87.     for i := 0 to lst.Count-1 do
  88.     begin
  89.       d := TNuclCountRec(lst[i]^).Count / (TotalCount / 100);
  90.       WriteLn(TNuclCountRec(lst[i]^).Name+' '+FormatFloat('0.000', d));
  91.     end;
  92.   finally
  93.     lst.Free();
  94.   end;
  95. end;
  96.  
  97. procedure ExportCount(sSeq: string);
  98. var
  99.   i, n: Integer;
  100. begin
  101.   n := 0;
  102.   for i := 0 to NCArrLen-1 do
  103.   begin
  104.     if NCArr[i].Name = sSeq then
  105.     begin
  106.       n := NCArr[i].Count;
  107.       Break;
  108.     end;
  109.   end;
  110.   WriteLn(IntToStr(n) + #$09 + sSeq);
  111. end;
  112.  
  113. procedure Main();
  114. var
  115.   s, sPrevFrame: ShortString;
  116.   bSkip: Boolean;
  117.   iFrameLen: Integer;
  118. begin
  119.   shNucl := TStringHash.Create(16);
  120.   NCArrLen := 0;
  121.   SetLength(NCArr, NCArrLen);
  122.   bSkip := True;
  123.   Total1 := 0;
  124.   Total2 := 0;
  125.   sPrevFrame := '';
  126.   iFrameLen := 0;
  127.   FormatSettings.DecimalSeparator := '.';
  128.  
  129.   while True do
  130.   begin
  131.     ReadLn(s);
  132.     if Length(s) = 0 then Break;
  133.  
  134.     if bSkip then
  135.     begin
  136.       if Pos('THREE', s) > 0 then
  137.       begin
  138.         bSkip := False;
  139.       end;
  140.       Continue;
  141.     end;
  142.  
  143.     // prev frame + 18
  144.     if iFrameLen > 0 then
  145.     begin
  146.       if Length(sPrevFrame) <> (iFrameLen + 18) then
  147.         SetLength(sPrevFrame, iFrameLen + 18);
  148.       Move(s[1], sPrevFrame[iFrameLen+1], 18);
  149.       CountFrame(iFrameLen, UpperCase(sPrevFrame));
  150.     end
  151.     else
  152.     begin
  153.       SetLength(sPrevFrame, Length(s) + 18);
  154.     end;
  155.     iFrameLen := Length(s);
  156.     Move(s[1], sPrevFrame[1], iFrameLen);
  157.   end;
  158.  
  159.   SetLength(sPrevFrame, iFrameLen);
  160.   CountFrame(iFrameLen, UpperCase(sPrevFrame));
  161.  
  162.   shNucl.Free();
  163.  
  164.   // export result
  165.   ExportFreq(1, Total1);
  166.   WriteLn();
  167.   ExportFreq(2, Total2);
  168.   WriteLn();
  169.   ExportCount('GGT');
  170.   ExportCount('GGTA');
  171.   ExportCount('GGTATT');
  172.   ExportCount('GGTATTTTAATT');
  173.   ExportCount('GGTATTTTAATTTATAGT');
  174. end;
  175.  
  176. begin
  177.   Main();
  178. end.
  179.  

AlexK

  • Jr. Member
  • **
  • Posts: 55
Re: fyi benchmarks game k-nucleotide
« Reply #21 on: May 31, 2016, 08:30:01 pm »
Do not resize arrays, strings or memory chunks - it cause data move.

THashMap uses dynamic arrays for its key/value buckets(default memory manager).
When data grows, THashMap constantly expands or contracts key/value arrays during keys rehashing.
C++ libraries more likely optimize those things by direct management of memory buffers, strategy of allocation, etc.

It would be interesting to see in free time how THashMap could be optimized with direct memory handling.
« Last Edit: May 31, 2016, 09:37:56 pm by AlexK »

AlexK

  • Jr. Member
  • **
  • Posts: 55
Re: fyi benchmarks game k-nucleotide
« Reply #22 on: May 31, 2016, 08:55:00 pm »
My implementation from scratch:

With threading(if doable) it would be close to C++ version, but still ~2 times slower. Different algorithms.

On 25Mb FASTA file:
C++   0m1.41s  (2 cores)
FPC    0m5.38s  (1 core)
« Last Edit: May 31, 2016, 09:44:17 pm by AlexK »

serbod

  • Full Member
  • ***
  • Posts: 142
Re: fyi benchmarks game k-nucleotide
« Reply #23 on: June 01, 2016, 10:55:48 am »
Same with some optimizations and multithreading:

Code: Pascal  [Select][+][-]
  1. { Sergey Bodrov (serbod@gmail.com) 2016-06-01 }
  2. program knucleotide;
  3.  
  4. uses SysUtils, Classes, IniFiles, syncobjs;
  5.  
  6. const THREADS_NUM = 4; // set max number of parallel threads
  7.  
  8. type
  9.   TNuclCountRec = record
  10.     Name: string;
  11.     Count: Integer;
  12.   end;
  13.   TNuclCountArr = array of TNuclCountRec;
  14.  
  15.   { TWorker }
  16.  
  17.   TWorker = class(TThread)
  18.   protected
  19.     procedure Execute(); override;
  20.   public
  21.     sFrame: ShortString;
  22.     Len: Integer;
  23.     Done: Boolean;
  24.   end;
  25.  
  26. var
  27.   shNucl: TStringHash;
  28.   NCArr: TNuclCountArr;
  29.   NCArrLen: Integer;
  30.   Total1, Total2: Integer;
  31.   WorkersCount: Integer;
  32.   lock: TCriticalSection;
  33.   WorkersList: TList;
  34.  
  35. procedure CountSeq(const sSeq: String);
  36. var
  37.   n: integer;
  38. begin
  39.   n := shNucl.ValueOf(sSeq);
  40.   if n = -1 then
  41.   begin
  42.     lock.Acquire();
  43.     n := NCArrLen;
  44.     Inc(NCArrLen);
  45.     SetLength(NCArr, NCArrLen);
  46.     NCArr[n].Name := sSeq;
  47.     NCArr[n].Count := 1;
  48.     shNucl.Add(sSeq, n);
  49.     lock.Release();
  50.   end
  51.   else
  52.   begin
  53.     InterLockedIncrement(NCArr[n].Count);
  54.   end;
  55. end;
  56.  
  57. procedure CountFrame(Len: Integer; const sFrame: ShortString);
  58. var
  59.   i, n: integer;
  60.   s: string;
  61. begin
  62.   n := Length(sFrame);
  63.   for i := 1 to Len do
  64.   begin
  65.     s := Copy(sFrame, i, 1);
  66.     CountSeq(s);
  67.     InterLockedIncrement(Total1);
  68.  
  69.     if i < n then // last pair
  70.     begin
  71.       s := Copy(sFrame, i, 2);
  72.       CountSeq(s);
  73.       InterLockedIncrement(Total2);
  74.     end;
  75.  
  76.     s := Copy(sFrame, i, 18);
  77.     if Pos('ggt', s) = 1 then
  78.     begin
  79.       CountSeq('ggt');
  80.       if Pos('ggta', s) = 1 then
  81.       begin
  82.         CountSeq('ggta');
  83.         if Pos('ggtatt', s) = 1 then
  84.         begin
  85.           CountSeq('ggtatt');
  86.           if Pos('ggtattttaatt', s) = 1 then
  87.           begin
  88.             CountSeq('ggtattttaatt');
  89.             if Pos('ggtattttaatttatagt', s) = 1 then CountSeq('ggtattttaatttatagt');
  90.           end;
  91.         end;
  92.       end;
  93.     end;
  94.   end;
  95. end;
  96.  
  97. function SortItems(Item1, Item2: Pointer): Integer;
  98. begin
  99.   Result := TNuclCountRec(Item2^).Count - TNuclCountRec(Item1^).Count;
  100. end;
  101.  
  102. procedure ExportFreq(n, TotalCount: Integer);
  103. var
  104.   i: Integer;
  105.   d: Double;
  106.   lst: TList;
  107. begin
  108.   lst := TList.Create();
  109.   try
  110.     for i := 0 to NCArrLen-1 do
  111.     begin
  112.       if Length(NCArr[i].Name) <> n then Continue;
  113.       lst.Add(@NCArr[i]);
  114.     end;
  115.  
  116.     lst.Sort(@SortItems);
  117.  
  118.     for i := 0 to lst.Count-1 do
  119.     begin
  120.       d := TNuclCountRec(lst[i]^).Count / (TotalCount / 100);
  121.       WriteLn(UpperCase(TNuclCountRec(lst[i]^).Name)+' '+FormatFloat('0.000', d));
  122.     end;
  123.   finally
  124.     lst.Free();
  125.   end;
  126. end;
  127.  
  128. procedure ExportCount(sSeq: string);
  129. var
  130.   i, n: Integer;
  131. begin
  132.   n := 0;
  133.   for i := 0 to NCArrLen-1 do
  134.   begin
  135.     if NCArr[i].Name = sSeq then
  136.     begin
  137.       n := NCArr[i].Count;
  138.       Break;
  139.     end;
  140.   end;
  141.   WriteLn(IntToStr(n) + #$09 + UpperCase(sSeq));
  142. end;
  143.  
  144. procedure SpawnWorker(Len: Integer; const sFrame: ShortString);
  145. var
  146.   Worker: TWorker;
  147.   i: Integer;
  148. begin
  149.  
  150.   if WorkersList.Count < THREADS_NUM then
  151.   begin
  152.     Worker := TWorker.Create(True);
  153.     Worker.FreeOnTerminate := False;
  154.     WorkersList.Add(Worker);
  155.   end
  156.   else
  157.   begin
  158.     Worker := nil;
  159.     while WorkersCount >= THREADS_NUM do Sleep(0);
  160.     while not Assigned(Worker) do
  161.     begin
  162.       for i := 0 to WorkersList.Count-1 do
  163.       begin
  164.         if TWorker(WorkersList[i]).Done then
  165.         begin
  166.           Worker := TWorker(WorkersList[i]);
  167.         end;
  168.       end;
  169.     end;
  170.   end;
  171.  
  172.   Worker.Len := Len;
  173.   Worker.sFrame := sFrame;
  174.   Worker.Done := False;
  175.   Worker.Suspended := False;
  176.  
  177. end;
  178.  
  179. procedure Main();
  180. var
  181.   s, sPrevFrame: ShortString;
  182.   bSkip: Boolean;
  183.   iFrameLen: Integer;
  184.   ii: Integer;
  185. begin
  186.   shNucl := TStringHash.Create(16);
  187.   lock := TCriticalSection.Create();
  188.   WorkersList := TList.Create();
  189.  
  190.   NCArrLen := 0;
  191.   SetLength(NCArr, NCArrLen);
  192.   bSkip := True;
  193.   Total1 := 0;
  194.   Total2 := 0;
  195.   sPrevFrame := '';
  196.   iFrameLen := 0;
  197.   FormatSettings.DecimalSeparator := '.';
  198.   WorkersCount := 0;
  199.   ii := 0;
  200.  
  201.   while True do
  202.   begin
  203.     ReadLn(s);
  204.     if Length(s) = 0 then Break;
  205.  
  206.     if bSkip then
  207.     begin
  208.       if Pos('>THREE', s) > 0 then
  209.       begin
  210.         bSkip := False;
  211.       end;
  212.       Continue;
  213.     end;
  214.  
  215.     Inc(ii);
  216.  
  217.     // prev frame + 18
  218.     if iFrameLen > 0 then
  219.     begin
  220.       if Length(sPrevFrame) <> (iFrameLen + 18) then
  221.         SetLength(sPrevFrame, iFrameLen + 18);
  222.       Move(s[1], sPrevFrame[iFrameLen+1], 18);
  223.       if THREADS_NUM > 1 then
  224.         SpawnWorker(iFrameLen, sPrevFrame)
  225.       else
  226.         CountFrame(iFrameLen, sPrevFrame);
  227.     end
  228.     else
  229.     begin
  230.       SetLength(sPrevFrame, Length(s) + 18);
  231.     end;
  232.     iFrameLen := Length(s);
  233.     Move(s[1], sPrevFrame[1], iFrameLen);
  234.   end;
  235.  
  236.   SetLength(sPrevFrame, iFrameLen);
  237.   CountFrame(iFrameLen, sPrevFrame);
  238.  
  239.   while WorkersCount > 0 do Sleep(1);
  240.  
  241.   WorkersList.Free();
  242.   lock.Free();
  243.   shNucl.Free();
  244.  
  245.   // export result
  246.   ExportFreq(1, Total1);
  247.   WriteLn();
  248.   ExportFreq(2, Total2);
  249.   WriteLn();
  250.   ExportCount('ggt');
  251.   ExportCount('ggta');
  252.   ExportCount('ggtatt');
  253.   ExportCount('ggtattttaatt');
  254.   ExportCount('ggtattttaatttatagt');
  255. end;
  256.  
  257. { TWorker }
  258.  
  259. procedure TWorker.Execute;
  260. begin
  261.   while not Terminated do
  262.   begin
  263.     if not Done then
  264.     begin
  265.       InterlockedIncrement(WorkersCount);
  266.       CountFrame(Len, sFrame);
  267.       InterlockedDecrement(WorkersCount);
  268.       Done := True;
  269.     end;
  270.     Sleep(0);
  271.   end;
  272. end;
  273.  
  274. begin
  275.   Main();
  276. end.
  277.  

AlexK

  • Jr. Member
  • **
  • Posts: 55
Re: fyi benchmarks game k-nucleotide
« Reply #24 on: June 01, 2016, 10:43:13 pm »
Same with some optimizations and multithreading

It's 2.1 times slower than your first sequential version, for some reason. I added cthreads unit in the uses clause to compile it.
But faster than my port of the C++ version.

It gives slightly distorted results for long strings due to a critical bug in ghashmap unit, that I fixed locally(one line).
It's slow in memory management, tens of thousands of dynamic arrays are constantly resized in library implementation. Would be interesting to optimize ghashmap some time.
Code: Pascal  [Select][+][-]
  1. {Alex Karbivnichiy}
  2.  
  3. {$mode objfpc}
  4. {$modeswitch advancedrecords}
  5.  
  6. uses sysutils, strutils, ghashmap, gmap, gutil;
  7.  
  8. const tochr : array[0..3] of char = ('A', 'C', 'T', 'G');
  9. var   tonum : array[0..255] of byte;
  10.  
  11. type
  12.    hashkey = record
  13.       data : qword; size : byte;
  14.       class operator = (lkey, rkey : hashkey) : boolean;
  15.    end;
  16.    hashfunctor = class
  17.    public class function hash(key : hashkey; buckets : sizeUint) : sizeUint; inline;
  18.    end;
  19.    hashtable = specialize THashMap<hashkey, longint, hashfunctor>;
  20.  
  21. class operator hashkey.= (lkey, rkey : hashkey) : boolean;
  22. begin
  23.    result := lkey.data = rkey.data;
  24. end;
  25.  
  26. procedure reset(var hk : hashkey; inStr : ansistring; beg : longword; fin : longword); inline;
  27. begin
  28.    hk.size := fin - beg;
  29.    hk.data := 0;
  30.    repeat
  31.       hk.data := hk.data shl 2;
  32.       hk.data := hk.data or qword(tonum[ord(inStr[beg+1])]);
  33.       inc(beg);
  34.    until beg = fin;
  35. end;
  36.  
  37. function keyToStr(hk : hashkey) : ansistring;
  38. var
  39.    tmp : ansistring = '';
  40.    tmp1 : qword;
  41.    i : longword = 0;
  42. begin
  43.    tmp1 := hk.data;
  44.    while i <> hk.size do begin
  45.       tmp := tmp + tochr[tmp1 and 3];
  46.       tmp1 := tmp1 shr 2;
  47.       inc(i);
  48.    end;
  49.    result := tmp;
  50. end;
  51.  
  52. class function hashfunctor.hash(key : hashkey; buckets : sizeUint) : sizeUint; inline;
  53. begin
  54.    result := key.data mod buckets;
  55. end;
  56.  
  57. function calculate(inStr : ansistring; size : byte; beg : byte = 0) : hashtable;
  58. var
  59.    frequencies : hashtable;
  60.    k : hashkey;
  61.    i : integer;
  62. begin
  63.    frequencies := hashtable.create;
  64.    for i:=beg to length(inStr)+1-size do begin
  65.       reset(k, inStr, i, i+size);
  66.       frequencies[k] := frequencies[k] + 1;
  67.    end;
  68.    result := frequencies;
  69. end;
  70.  
  71. procedure writeFrequencies(inStr : ansistring; size : byte);
  72. type
  73.    stgreater = specialize TGreater<longword>;
  74.    stmap = specialize TMap<longword, string, stgreater>;
  75. var
  76.    sum : longword;
  77.    frequencies: hashtable;
  78.    freq : stmap;
  79.    iter1 : hashtable.TIterator;
  80.    iter2 : stmap.TIterator;
  81.    psum : double;
  82. begin
  83.    sum := length(inStr) + 1 - size;
  84.    frequencies := calculate(inStr, size);
  85.    freq := stmap.create;
  86.    iter1 := frequencies.iterator;
  87.    repeat
  88.       freq.insert(iter1.value, keyToStr(iter1.key));
  89.    until not iter1.next;
  90.    iter2 := freq.min;
  91.    repeat
  92.       if sum <> 0 then psum := double(100 * iter2.key) / sum else psum := 0.0;
  93.       writeln(format('%s  %-2.3f', [iter2.value, psum]));
  94.    until not iter2.next;
  95.    writeln;
  96. end;
  97.  
  98. procedure writeCount(inStr : ansistring; str : ansistring);
  99. var
  100.    size : byte;
  101.    frequencies : hashtable;
  102.    k : hashkey;
  103. begin
  104.    size := length(str);
  105.    frequencies := calculate(inStr, size);
  106.    reset(k, str, 0, size);
  107.  
  108.    writeln(format('%-7d  %s', [frequencies[k], str]));
  109. end;
  110.  
  111. procedure main;
  112. var
  113.    buf : shortstring;
  114.    inStr : ansistring;
  115.    i : byte;
  116. begin
  117.    for i := 0 to length(tochr)-1 do begin tonum[ord(tochr[i])] := i; end;
  118.    repeat readln(buf) until AnsiStartsStr('>THREE', buf); // seek
  119.    repeat
  120.       readln(buf);
  121.       if buf[1] <> ';' then inStr := inStr + buf;
  122.    until eof(input);
  123.    
  124.    inStr := upcase(inStr);
  125.  
  126.    writeFrequencies(inStr, 1);
  127.    writeFrequencies(inStr, 2);
  128.    writeCount(inStr, 'GGT');
  129.    writeCount(inStr, 'GGTA');
  130.    writeCount(inStr, 'GGTATT');
  131.    writeCount(inStr, 'GGTATTTTAATT');
  132.    writeCount(inStr, 'GGTATTTTAATTTATAGT');
  133. end;
  134.  
  135. begin
  136.    main;
  137. end.
  138.  

Critical bug in ghashmap, added line 9:

Code: Pascal  [Select][+][-]
  1. function THashmap.GetData(key:TKey):TValue;inline;
  2. var i,h,bs:longint;
  3. begin
  4.    h:=Thash.hash(key,FData.size);
  5.    bs:=(FData[h]).size;
  6.    for i:=0 to bs-1 do begin
  7.       if (((FData[h])[i]).Key=key) then exit(((FData[h])[i]).Value);
  8.    end;
  9.    result := default(TValue); // added this line
  10. end;
« Last Edit: June 01, 2016, 11:40:38 pm by AlexK »

AlexK

  • Jr. Member
  • **
  • Posts: 55
Re: fyi benchmarks game k-nucleotide
« Reply #25 on: June 07, 2016, 05:32:59 am »
I rewrote THashMap, and now the test shows same performance as C++. I still have to refactor some code.

One interesting discovery, -gv option gives additional ~11% speedup to the -O4 option.

UPD:

-gv enforces cmem unit, so that's why.
« Last Edit: June 07, 2016, 06:25:27 am by AlexK »

serbod

  • Full Member
  • ***
  • Posts: 142
Re: fyi benchmarks game k-nucleotide
« Reply #26 on: June 07, 2016, 10:39:43 am »
Quote
The work is to use the ordinary built-in or library hash table implementation to accumulate count values - retrieve the count value for a key, update the count value for that key. Don't optimize away the work.

 :(

Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: fyi benchmarks game k-nucleotide
« Reply #27 on: June 07, 2016, 11:14:07 am »
Quote
The work is to use the ordinary built-in or library hash table implementation to accumulate count values - retrieve the count value for a key, update the count value for that key. Don't optimize away the work.

 :(
Doesn't say that optimizing the underlying library is not allowed ;)
If AlexK submit his THashMap optimization as patches, then we will get the optimization for free.

AlexK

  • Jr. Member
  • **
  • Posts: 55
Re: fyi benchmarks game k-nucleotide
« Reply #28 on: June 20, 2016, 07:11:07 pm »
C++ version virtually creates a binary Trie(https://en.wikipedia.org/wiki/Trie). Uses an optimized collisionless hash function for this particular use case(only 4 letters), and uses GCC's hashmap library as an expandable array(FPC has dynamic arrays for that).
I switched to creating a more general character Trie in FPC, will come back to it later.

 

TinyPortal © 2005-2018