Recent

Author Topic: [SOLVED] How to speed up this looping cycle  (Read 33387 times)

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] How to speed up this looping cycle
« Reply #30 on: July 17, 2014, 09:07:07 pm »
1) There are definate speed enhancements to SHA1 between 20%-35% in my tests. I've also added the same transform function to the file hashing angle of QuickHash and speed improvements are gained there too of around the same amount, filesystem issues permitting.
Just for testing, you might want to try the same compiler I'm using. It's in a package by TrueTom called Laz4Android used to be here:  (laz4android1.1-41139). It seems the compiler optimizations are getting changed and you will not see the same results I am getting using a different version of the compiler.

2) Further to our private message, I have incorporated the MD5 code for MD5Transform, too. Again, files that took 2m 27s in earlier versions of QuickHash now compute is 1m 36s after a reboot! So that's an astounding improvement with no filesystem caching involved. Clearly the Transform functions in the FPC hashing libraries have been in need of some tender love and care.
Glad you didn't mind switching back to public thread. I believe you should get better results. Just try a small project with nothing more than MD5 hashing using laz4android1.1-41139. As you mentioned previously, your compiler does not recognize USEEBP which has a great impact on MD5Transform.

3) You and I both agree that for the benefit of Freepascal itself and the other forum members, it would be useful if you post the MD5 implementation too, just as you did for SHA1. Then we have a public record of our fixes and tests. So we can then submit patches for both MD5 and SHA1 at the same time. I am happy to contribute the completed and commented customised modules from QuickHash if it helps.
MD5Transform code:
Code: [Select]
procedure MD5Transform(var Context: TMDContext; Buffer: Pointer);
var
  a, b, c, d: Cardinal;
  Block: array[0..15] of Cardinal;
begin
  Invert(Buffer, @Block, 64);
  a := Context.State[0];
  b := Context.State[1];
  c := Context.State[2];
  d := Context.State[3];

{$push}
{$r-,q-}

{$I md5replaced.txt}  //<---- from the attachment

  inc(Context.State[0],a);
  inc(Context.State[1],b);
  inc(Context.State[2],c);
  inc(Context.State[3],d);
{$pop}
  inc(Context.Length,64);
end;   
md5replaced.inc is in the attachment. I got a major speed boost when I used -OoUSEEBP

4) So come on then...spill the beans! Help us all understand exactly why and how your code works so much faster? I am intrigued and must confess to not fully understanding it at the moment. I undertsadn about rotation in cryptographic ciphers and stuff, and I think this is what teh Transform functions do. But I don't understand why your structures work so much better than the currenft FPC 2.6.4 implementation.
The current implementation is fine, readable and maintainable and I would not be able to do "beans" without it. The changes I did were inspired by Avra's post (Thanks Avra)
What I did is already documented on Wikipedia and elsewhere. For sha1 was mainly "loop unrolling", while for md5 simply canceling calls and replacing them with the code being called, similar to using "inline".

Adding -OoUSEEBP here boosted the speed a lot because Intel CPUs have a limited number of registers (32 bit CPUs have EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP) last two are usually reserved (by the compiler) to deal with the stack, and to refer to parameters and variables. That leaves only six registers. If a code segment needs more than six registers, the compiler generates extra code to save registers to temporary memory locations which slows down the speed. -OoUSEEBP instructs the compiler to use EBP as a general register which enhances the speed by increasing the number of available registers from six to seven. In the case of MD5Transform the enhancement was huge (at least on my system  :P ) because the code that was generated by the compiler needed seven registers precisely.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: How to speed up this looping cycle
« Reply #31 on: July 19, 2014, 12:58:47 pm »
So the question is, can Engkins suggestion be incorporated into the Freepascal trunk for the SHA1 Unit, and probably also the MD5 one (adjusted for the algorithm of course)? It results in the same hashes but computes them much faster.
Contribute by providing a patch, don't forget to rerun existing tests (if available).
Any news on this? Even posting the existing code+instructions on the bugtracker would help...
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] How to speed up this looping cycle
« Reply #32 on: July 23, 2014, 11:48:27 pm »
@Gizmo, for more speed and to get rid of optimization options, I tried SHA1Transform in assembly code:
Code: [Select]
procedure SHA1Transform(var ctx: TSHA1Context; const Buf: Pointer);assembler;
var
  pctx:^TSHA1Context;
  pbuf:pointer;
  A, B, C, D, E, T: Cardinal;
  Data: array[0..15] of Cardinal;
{$I sha1transform.txt}  //<---- in the attachment
It is close to three times faster on my system.

So the question is, can Engkins suggestion be incorporated into the Freepascal trunk for the SHA1 Unit, and probably also the MD5 one (adjusted for the algorithm of course)? It results in the same hashes but computes them much faster.
Contribute by providing a patch, don't forget to rerun existing tests (if available).
Any news on this? Even posting the existing code+instructions on the bugtracker would help...
Good question.

avra

  • Hero Member
  • *****
  • Posts: 2514
    • Additional info
Re: [SOLVED] How to speed up this looping cycle
« Reply #33 on: July 24, 2014, 08:56:52 am »
It is close to three times faster on my system.
:)  8-)  :)
ct2laz - Conversion between Lazarus and CodeTyphon
bithelpers - Bit manipulation for standard types
pasettimino - Siemens S7 PLC lib

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #34 on: July 24, 2014, 12:34:58 pm »
It is close to three times faster on my system.
14 seconds versus 1 minutes 1 second on my system - about 4 times ;)
Test program attached.

Regarding the md5 version - I'm not seeing any performance improvement with fpc trunk x86, Windows- or probably doing something wrong?
No difference if I compile mdtest.lpi with and without -dMD5OPTIMIZE -OoUSEEBP in extra options...!?!?
Test program attached

Another question: the md5 optimization probably works on i386 but not x64, right?
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

avra

  • Hero Member
  • *****
  • Posts: 2514
    • Additional info
Re: [SOLVED] How to speed up this looping cycle
« Reply #35 on: July 24, 2014, 03:30:56 pm »
Maybe someone will find interesting this Pascal Hashing:
http://www.partow.net/programming/hashfunctions/index.html
ct2laz - Conversion between Lazarus and CodeTyphon
bithelpers - Bit manipulation for standard types
pasettimino - Siemens S7 PLC lib

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] How to speed up this looping cycle
« Reply #36 on: July 24, 2014, 05:38:22 pm »
It is close to three times faster on my system.
14 seconds versus 1 minutes 1 second on my system - about 4 times ;)
Test program attached.
Thanks for testing.

Regarding the md5 version - I'm not seeing any performance improvement with fpc trunk x86, Windows- or probably doing something wrong?
No difference if I compile mdtest.lpi with and without -dMD5OPTIMIZE -OoUSEEBP in extra options...!?!?
Test program attached
That's good news. Either the compiler is doing the same "inlining" and/or the performance of your CPU is not affected by the big number of push/call instructions.

Another question: the md5 optimization probably works on i386 but not x64, right?
Yes, this is for i386. Since x64 has plenty of registers, using -OoUSEEBP would not boost performance in this case.

Leledumbo

  • Hero Member
  • *****
  • Posts: 8747
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: [SOLVED] How to speed up this looping cycle
« Reply #37 on: July 24, 2014, 05:48:00 pm »
Another question: the md5 optimization probably works on i386 but not x64, right?
For x64, use -OoUSERBP

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #38 on: July 24, 2014, 06:00:42 pm »
Thanks, used -OoUSERBP and not surprisingly got perhaps 1 second (13 secs versus 14) with the test program.
Using a fairly old Intel i7 processor.

Looks like the sha1 assembler implementation is a good candidate for a patch while the md5 one less so..
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] How to speed up this looping cycle
« Reply #39 on: July 24, 2014, 06:16:37 pm »
Thanks, used -OoUSERBP and not surprisingly got perhaps 1 second (13 secs versus 14) with the test program.
Using a fairly old Intel i7 processor.
The reason might be hiding in the assembly files. I would take a look if you don't mind sharing them (-al).

Looks like the sha1 assembler implementation is a good candidate for a patch while the md5 one less so..
I'll work on an assembly version for md5 and see if it boosts the speed.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #40 on: July 24, 2014, 07:43:18 pm »
Thanks, used -OoUSERBP and not surprisingly got perhaps 1 second (13 secs versus 14) with the test program.
Using a fairly old Intel i7 processor.
The reason might be hiding in the assembly files. I would take a look if you don't mind sharing them (-al).
Ehm -OoUSERBP was the right option to use for x64 as per Leledumbo's post, right?
fpc -i confirms that option is available.

Attached the files

I'll work on an assembly version for md5 and see if it boosts the speed.
Thanks.
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] How to speed up this looping cycle
« Reply #41 on: July 24, 2014, 08:39:55 pm »
Ehm -OoUSERBP was the right option to use for x64 as per Leledumbo's post, right?
fpc -i confirms that option is available.
Lol, the red color was for me. Sorry for the confusion.

Attached the files
Thanks

Edit:
I don't see any optimization. Could you try it with:
-dMD5OPTIMIZE -OoREGVAR -O3 -OoUSEEBP
« Last Edit: July 24, 2014, 08:55:23 pm by engkin »

Gizmo

  • Hero Member
  • *****
  • Posts: 831
Re: [SOLVED] How to speed up this looping cycle
« Reply #42 on: July 24, 2014, 09:53:28 pm »
Guys

This is great stuff!

I looked into assembly options (see page one of thread) but having made some enquiries two points were made : maintainability and multi-system architectures. I was led to believe that an assembly coded solution is usually coded for a very specific platform, i.e. 32 or 64 bit, Intel or AMD, etc etc. So, given that Engkin FP solution was so good, I didn't pursue it further.

I tried your SHA1 assembly on a 64-bit Intel system and and to tweak some compiler directives (for AT&T I think it was) but then got further errors which, having googled it, seemed to suggest that something else was needed to compile it on that system.

So straight away I'm thinking : will we have problems if we incorporate an assembly version into the main FPC trunk? I'm sure I am talking nonsense and you guys know what you're talking about so any explanation would be awesome. But my thoughts were to incorporate Engkins FPC solution into the trunk.

Also, I too didn't notice a huge speed difference for MD5 using Engkins Freepascal coded improvement to to the transform on 64-bit AMD systems. On 32-bit systems though, it was about 30% faster.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: [SOLVED] How to speed up this looping cycle
« Reply #43 on: July 24, 2014, 11:49:41 pm »
In general I agree. Assembly coded alternative might not be that suitable, unlike the Pascal version. Eventually the compiler should produce comparable speeds. Meanwhile, when there is a need for speed, like in the case of quickhash, this might be one possible alternative prepared for 32bit FPC.

BigChimp used conditional defines to make it optional on 32bit systems:
Code: [Select]
{$IF (NOT(DEFINED(SHA1SLOW))) and (DEFINED(CPU386)) }
// Use assembler version if we have a suitable CPU as well
// Define SHA1SLOW to force use of original reference code
procedure SHA1Transform(var ctx: TSHA1Context; const Buf: Pointer);assembler;
..

I can't provide much support for 64bit systems, so I leave that to someone else and if anyone interested I'll share the small code that generates the pascal/assembly version which might help developing x64 assembly version.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: [SOLVED] How to speed up this looping cycle
« Reply #44 on: July 25, 2014, 07:24:18 am »
Yes, an assembly version is CPU-specific and yes, it requires maintenance as well.
(Edit: and the maintenance argument applies to the optimized Pascal version just as well... though there are more people here able to write Pascal versus assembler of course)

However, IMO it's not a problem to e.g. only provide an i386 implementation as it's one of the most used CPUs for FPC and the algorithms themselves aren't going to change.

When/if the compiler catches up and generates code that is comparable speedwise, it's easy enough to disable the assembler version.
« Last Edit: July 25, 2014, 07:28:37 am by BigChimp »
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

 

TinyPortal © 2005-2018