Recent

Author Topic: [SOLVED] How to speed up this looping cycle  (Read 33385 times)

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #45 on: July 25, 2014, 07:40:31 am »
I don't see any optimization. Could you try it with:
-dMD5OPTIMIZE -OoREGVAR -O3 -OoUSEEBP
Attached, using *cough* -OoUSERBP, and the rest on x64. Speed still about the same as the unoptimized version...
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: [SOLVED] How to speed up this looping cycle
« Reply #46 on: July 25, 2014, 09:45:27 am »
(in the source this can be done use {$optimization useebp} and {$optimization userbp}

Note userbp<>useebp.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #47 on: July 25, 2014, 10:20:09 am »
Thanks!

Note userbp<>useebp.
Ehm yes, I think that that is sufficiently clear going by the thread above ;)
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #48 on: July 25, 2014, 10:34:06 am »
(in the source this can be done use {$optimization useebp} and {$optimization userbp}

Note userbp<>useebp.
Implemented; committed together with test programs to
https://bitbucket.org/reiniero/flocate
This project uses md5 calcs; temporarily added the sha1/md5 test files.

It seems unlikely Gizmo is ever going to provide a patch to the bugtracker; I'm happy to do it... especially as engkin is doing all the work & if engkin comes up with a SHA1 assembler version ;)
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: [SOLVED] How to speed up this looping cycle
« Reply #49 on: July 25, 2014, 10:42:34 am »
Only for fear of messing something up big chimp. Not laziness.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #50 on: July 25, 2014, 10:48:38 am »
Well, patches are always reviewed by the developer looking at them before committing...
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] How to speed up this looping cycle
« Reply #51 on: July 25, 2014, 01:10:06 pm »
... & if engkin comes up with a SHA1 assembler version ;)
I assume you mean MD5 assembler version. Here it is:
Code: [Select]
procedure MD5Transform(var Context: TMDContext; Buffer: Pointer);assembler;
var
  pContext: ^TMDContext;
  pBuffer: Pointer;
  a, b, c, d: Cardinal;
  Block: array[0..15] of Cardinal;
{$I md5transform.inc}  //<---- in the attachment

It did not provide much gain on my system. Let's see if it does any better on x64 systems.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #52 on: July 25, 2014, 01:31:17 pm »
Whoops, yes md5. Thanks. Will test later & get back..
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #53 on: July 27, 2014, 10:11:59 am »
It did not provide much gain on my system. Let's see if it does any better on x64 systems.
1. Compiled with i386, it gives 11 seconds versus 13 seconds original code... however
2. The RFC1321 (md5) test suite fails - all hashes are different... have I screwed something up? (See attachment)
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] How to speed up this looping cycle
« Reply #54 on: July 28, 2014, 05:54:41 pm »
Sorry, it seems that I uploaded a wrong file. But I think I have some good news:
1- I assumed Invert procedure in md5 unit is identical to the one used in sha1 unit. That is not true:
Code: Text  [Select][+][-]
  1. // inverts the bytes of (Count div 4) cardinals from source to target.
  2. procedure Invert(Source, Dest: Pointer; Count: PtrUInt);
  3. ...
  4.     T^ := S[0] or (S[1] shl 8) or (S[2] shl 16) or (S[3] shl 24);
  5.  

Code: Text  [Select][+][-]
  1. // inverts the bytes of (Count div 4) cardinals from source to target.
  2. procedure Invert(Source, Dest: Pointer; Count: PtrUInt);
  3. ...
  4.     T^ := S[3] or (S[2] shl 8) or (S[1] shl 16) or (S[0] shl 24);
  5.  
On Intel CPUs md5 version of Invert does nothing useful, which means we can get rid of it and also get rid of Block and turn the pascal code to:
Code: [Select]
{$OPTIMIZATION USEEBP} //PEEPHOLE
procedure MD5Transform_NoCalls(var Context: TMDContext; Buffer: Pointer);
type
  TBlock = array[0..15] of Cardinal;
  PBlock = ^TBlock;
var
  a, b, c, d: Cardinal;
  //Block: array[0..15] of Cardinal absolute Buffer;
  Block: PBlock absolute Buffer;
begin
  //Invert(Buffer, @Block, 64);
  a := Context.State[0];
  b := Context.State[1];
  c := Context.State[2];
  d := Context.State[3];

{$push}
{$r-,q-}

{$I md5nocalls.inc}  //<--- in the attachment

  inc(Context.State[0],a);
  inc(Context.State[1],b);
  inc(Context.State[2],c);
  inc(Context.State[3],d);
{$pop}
  inc(Context.Length,64);
end;
{$OPTIMIZATION DEFAULT}

2-The assembler version that does the same is:
Code: [Select]
procedure MD5Transform_Assembler3(var Context: TMDContext; Buffer: Pointer);assembler;
var
  pContext: ^TMDContext;
  pBuffer: Pointer;
  a, b, c, d: Cardinal;
  //Block: array[0..15] of Cardinal;
{$I md5transform3.inc}  //<---- in the attachment

3-Working on MD5Transform will not boost the speed if MD5Transform is not used extensively. The speed tests used by BigChimp does not use big data, unlike his tests on sha1, and MD5Transform is not a bottleneck here. Based on that I suggest testing something like "MD5 of a million 'a' symbols":
Code: [Select]
var
  StartTime: TDateTime;
  EndTime: TDateTime;
  i: integer;
  s,ss: string;
...
begin
  // MD5 of a million 'a' symbols
  SetLength(s, 1000000);
  for i := 1 to 1000000 do s[i] := 'a';

  StartTime:=now;
  for i := 0 to 1000 do
    ss := LowerCase(MDPrint(MDString(s, MD_VERSION_5)));
  EndTime:=now;
  Log('Performance test finished. Elapsed time:',[]);
  Log(TimeToStr(EndTime-StartTime),[]);

Gizmo's tests on HDs are more realistic and these optimizations were proposed for that kind of usage specifically. I assume BigChimp's flocate should also benefit from it.

On my system I have a little bit over three times boost in speed using the pascal/assembler versions. Both seem to be close to each other in speed. Hopefully I did not upload another wrong file. Waiting your tests.
« Last Edit: July 28, 2014, 05:58:47 pm by engkin »

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: [SOLVED] How to speed up this looping cycle
« Reply #55 on: July 28, 2014, 06:31:26 pm »
If I get chance tonight, I will try to implement and test. Well done Engkin!

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #56 on: July 29, 2014, 11:07:46 am »
Thanks for all the work!

3-Working on MD5Transform will not boost the speed if MD5Transform is not used extensively. The speed tests used by BigChimp does not use big data, unlike his tests on sha1, and MD5Transform is not a bottleneck here. Based on that I suggest testing something like "MD5 of a million 'a' symbols":
Thanks; committed your test; 8 seconds versus 3 seconds (Pascal version) on my system.

Based on the above I think submitting an assembler version for md5 will not be very useful as presumably both x64_64 and i386 can profit from the Pascal version..

Edit: tested with x64 compiler - 9 seconds in both cases.
relevant added code is
Code: [Select]
{$IFDEF CPUI386}
{$OPTIMIZATION USEEBP} //PEEPHOLE
{$ENDIF}
{$IFDEF CPUX86_64}
{$OPTIMIZATION USERBP} //PEEPHOLE
{$ENDIF}
Strange...
« Last Edit: July 29, 2014, 11:23:50 am by BigChimp »
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] How to speed up this looping cycle
« Reply #57 on: July 30, 2014, 05:26:53 am »
8 seconds versus 3 seconds (Pascal version) on my system.
Are you saying that the modified Pascal version is faster than the assembly version on your system? If so, then I assume the version of the compiler you are using is doing a better job than the older one I'm using.

Based on the above I think submitting an assembler version for md5 will not be very useful as presumably both x64_64 and i386 can profit from the Pascal version..
I agree.

Edit: tested with x64 compiler - 9 seconds in both cases.
relevant added code is
Code: [Select]
{$IFDEF CPUI386}
{$OPTIMIZATION USEEBP} //PEEPHOLE
{$ENDIF}
{$IFDEF CPUX86_64}
{$OPTIMIZATION USERBP} //PEEPHOLE
{$ENDIF}
Strange...
It seems that the compiler is not doing as good on x64 as it did on i386. Thanks to you, I did modify the assembly code a little bit and the speed exceeded four times on my system. Sorry, could not resist trying:
Code: [Select]
procedure MD5Transform_Assembler4(var Context: TMDContext; Buffer: Pointer);assembler;
var
  pContext: ^TMDContext;
  pBuffer: Pointer;
  a, b, c, d: Cardinal;
  //Block: array[0..15] of Cardinal;
{$I md5transform4.inc}  //<--- again, in the attachment

Basically, it is identical code to the previous one. Just changed the order of instructions so that next instruction is not dependent on current one, if possible.

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: [SOLVED] How to speed up this looping cycle
« Reply #58 on: July 31, 2014, 12:24:16 am »
OK, I have managed to implement the SHA1 assemler version (thanks to EngKin for the {$ASMMODE INTEL} compiler directive tip!)

SHA1
For files, running on a 32-bit Windows virtual machine hosted on a 64-bit AMD Linux workstation I saw :

An average of 13 to 14 seconds using the Pascal enhanced SHA1Transform for a 1Gb file
An average of 6-8 seconds using the new assembly compiled SHA1Transform for a 1Gb file

For disks, running on the same platform :

Well, I need to check my time computations. They seem to average about 7Gb per minute (which is still very fast), but from time to time, it leaps to nearly 14gb per minute (which, if true, is mind blowingly fast considering my workstation is about 6 years old!)!! I'm not sure if that is due to disk activity, disk caching etc, or if it';s the way my program is computing the time. I will need to do more tests on native hardware rather than through virtualisation.

For MD5
An average of 10 seconds using the Pascal enhanced MDTransform for a 1Gb file
An average of 5 seconds using the new assembly compileded MDTransform for a 1Gb file.

So, on my system, with this architecture, both look very promising. 50% for files with MD5. And around 40% for files with SHA1!

EngKins new paragraph to his CV (or 'Resume' if he's American) : Developed the fastest open source transform procedure on Earth using assembly!
« Last Edit: July 31, 2014, 12:58:54 am by Gizmo »

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #59 on: July 31, 2014, 10:06:56 am »
8 seconds versus 3 seconds (Pascal version) on my system.
Are you saying that the modified Pascal version is faster than the assembly version on your system? If so, then I assume the version of the compiler you are using is doing a better job than the older one I'm using.
No, I was comparing the original pascal version with your optimized Pascal version - because you posted i386 only machine code that you said was about as fast as the optimized pascal code. Using the optimized Pascal code could not only be useful for i386 but also x64 and perhaps(?) other processors with enough registers as well...

so (as we stand now) giving the assembler for i386 and the original code for all other platforms/if MD5SLOW is defined could be a good idea, right?
(Unless other x64 users with different procs do see significant savings...)
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

 

TinyPortal © 2005-2018