Recent

Author Topic: UTF8Length: A Faster Implementation.  (Read 29976 times)

shobits1

  • Sr. Member
  • ****
  • Posts: 271
  • .
Re: UTF8Length: A Faster Implementation.
« Reply #45 on: August 24, 2016, 09:54:42 am »
@Shobits1, impressive!  Would you mind testing UTF8Len_SIMD:
x2 performance of myUTF8Length

1uc
UTF8Len_SIMD         : 31457409   :   21258898 ticks
2uc
UTF8Len_SIMD         : 15730689   :   22455450 ticks
3uc
UTF8Len_SIMD         : 10551297   :   21823395 ticks
4uc
UTF8Len_SIMD         :  8388609   :   23045015 ticks
mixed
UTF8Len_SIMD         : 14880187   :   21405075 ticks
« Last Edit: August 24, 2016, 09:56:15 am by shobits1 »

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: UTF8Length: A Faster Implementation.
« Reply #46 on: August 24, 2016, 09:59:36 am »
@MathMan, Good catch!


@Shobits1, impressive!  Would you mind testing UTF8Len_SIMD:
Code: Pascal  [Select][+][-]
  1. {$asmmode intel}
  2.  
  3. function UTF8Len_aligned16(p16: pchar; BlockCount: PtrInt):PtrInt;assembler;
  4. const
  5.   ZEROMASK   :array[0..15] of byte=($00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00);
  6.   ONEMASK   :array[0..15] of byte=($01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01);
  7.   ONEMASKx80:array[0..15] of byte=($80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80);
  8.   ONEMASKxFF:array[0..15] of byte=($FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF);
  9. Label
  10.  loop;
  11. asm
  12.   push ecx
  13.  
  14.   ...
  15.  
  16. Loop:
  17.   { get 16 bytes }
  18.   MOVDQA xmm6, [p16]
  19.  
  20.   { Invert 16 bytes }
  21.   MOVDQA xmm7, xmm6
  22.   ANDNPD xmm7, ONEMASKxFF { xmm7 = not xmm6}
  23.  
  24.   ...
  25.  
  26.   pop ecx
  27. end;
  28.  
  29. function UTF8Len_SIMD(p: PChar; ByteCount: PtrInt): PtrInt;
  30. ...
  31. begin
  32.   ...
  33. end;
  34.  

Regarding the asm function - this is not safe for Windows! Their ABI only allows XMM0-XMM5 as scratch registers iirc (at least this is the case for the 64bit ABI but I am quite certain it's the same for 32 bit). You should safe XMM6 and XMM7! Also I think that this is not the fastest variant possible but I'll give it a try. I am working on SSE and AVX routines - give me a couple of days to finish this and I'll post here.

The Pascal code now looks correct, thanks.

Kind regards,
MathMan

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #47 on: August 24, 2016, 11:12:10 am »
All the previous remarks remains about the same here except my modified function (always first  :D ), and again the old UTF8Length is better than the new one when inlined, so @juha you may consider reverting to the old UTF8Length.
I did not touch the old UTF8Length. Mattias did in r52857. Now it is optimized and inlined for the most common 1-byte case but is slightly slower for others.
In practice the net result will be positive because online data always contains ASCII meta-data and markup tags even if text is non-European origin.
The whole function is way too big for inlining.
In the same r52857 Mattias improved also the inlined UTF8CharacterLengthFast. Please take a look.

Quote
the myUTF8Length (I hope there is nothing wrong)
Did you test with a 64-bit compiler?
You have 32-bit masks ($80808080) and you dec ByteCount in 4-byte (32-bit) chunks, yet the integers are potentially 64-bit PtrInt and PtrUInt.
What does this do in a 64-bit system?
Code: Pascal  [Select][+][-]
  1.   pu:PPtrUInt;
  2. ...
  3.   inc(pu);

This makes no sense:
Code: Pascal  [Select][+][-]
  1. case b of
  2.   0..127   : inc(Result);
  3.   192..223 : inc(Result);
  4.   224..239 : inc(Result);
  5.   240..247 : inc(Result);
  6. end;

Why not just:
Code: Pascal  [Select][+][-]
  1. if b in [0..127, 192..247] then inc(Result);
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

shobits1

  • Sr. Member
  • ****
  • Posts: 271
  • .
Re: UTF8Length: A Faster Implementation.
« Reply #48 on: August 24, 2016, 11:54:34 am »
I did not touch the old UTF8Length. Mattias did in r52857.
sorry for the mix up  :-[.

in r52857.
we are talking about the same code

Now it is optimized and inlined for the most common 1-byte case but is slightly slower for others.
No the old one still perform better when inlined. (see benchmark)

In practice the net result will be positive because online data always contains ASCII meta-data and markup tags even if text is non-European origin.
That's why I made the utf8 mixed file : 3/8 uc1 (ascii), 2/8 uc2, 2/8 uc3, 1/8 uc4

The whole function is way too big for inlining.
when I say inlining, I don't really mean the UTF8Length itself (doesn't make sense for me) but the inlining of UTF8CharacterLength.

In the same r52857 Mattias improved also the inlined UTF8CharacterLengthFast. Please take a look.
I'll already benchmark it under the name UTF8LengthFast(alternative)

Did you test with a 64-bit compiler?
No, I never intended to load 64bit of data even in 64bit environment, so I guess PtrUInt should be changed to integer and PPtrUInt to PInteger.


This makes no sense:
Code: Pascal  [Select][+][-]
  1. case b of
  2.   0..127   : inc(Result);
  3.   192..223 : inc(Result);
  4.   224..239 : inc(Result);
  5.   240..247 : inc(Result);
  6. end;
  7.  

Why not just:
Code: Pascal  [Select][+][-]
  1. if b in [0..127, 192..247] then inc(Result);
:o
Well this is hilarious, I paused at this block many times and have feeling that something wrong but never gave it deeper though since wouldn't hurt the performance.  :D


marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: UTF8Length: A Faster Implementation.
« Reply #49 on: August 24, 2016, 01:11:00 pm »
Regarding the asm function - this is not safe for Windows! Their ABI only allows XMM0-XMM5 as scratch registers iirc (at least this is the case for the 64bit ABI but I am quite certain it's the same for 32 bit)

You are right for win64, but that ABI specifies SSE for a lot of floating point usage. The 32-bit one doesn't.

32-bit doesn't have an unified ABI, stdcall is what comes closest, and that doesn't say anything about XMM usage. xmm registers are newer than the "ABI', so are undefined, though some compilers might assume things internally, but save when they call other procedures.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #50 on: August 24, 2016, 08:46:19 pm »
@Shobits1, impressive!  Would you mind testing UTF8Len_SIMD:
x2 performance of myUTF8Length

1uc
UTF8Len_SIMD         : 31457409   :   21258898 ticks
2uc
UTF8Len_SIMD         : 15730689   :   22455450 ticks
3uc
UTF8Len_SIMD         : 10551297   :   21823395 ticks
4uc
UTF8Len_SIMD         :  8388609   :   23045015 ticks
mixed
UTF8Len_SIMD         : 14880187   :   21405075 ticks


Thank you! Your test showed a bug in my code: I missed adjusting the left-over pointer. I corrected the code.

It seems to be 2.7 times faster than your code. But this one is more suitable for big text around 150 bytes or more.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #51 on: August 24, 2016, 08:52:59 pm »
Regarding the asm function - this is not safe for Windows! Their ABI only allows XMM0-XMM5 as scratch registers iirc (at least this is the case for the 64bit ABI but I am quite certain it's the same for 32 bit). You should safe XMM6 and XMM7!
Thank you for the information. I'll switch to using XMM4/5 to make it usable on 64-bit.

Also I think that this is not the fastest variant possible but I'll give it a try. I am working on SSE and AVX routines - give me a couple of days to finish this and I'll post here.
I am eager to see your version.

The Pascal code now looks correct, thanks.
It had a bug in the left-over loop was visible through shobits1's test. Fixed.

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: UTF8Length: A Faster Implementation.
« Reply #52 on: September 04, 2016, 03:22:18 pm »
Regarding the asm function - this is not safe for Windows! Their ABI only allows XMM0-XMM5 as scratch registers iirc (at least this is the case for the 64bit ABI but I am quite certain it's the same for 32 bit). You should safe XMM6 and XMM7!
Thank you for the information. I'll switch to using XMM4/5 to make it usable on 64-bit.

Also I think that this is not the fastest variant possible but I'll give it a try. I am working on SSE and AVX routines - give me a couple of days to finish this and I'll post here.
I am eager to see your version.

The Pascal code now looks correct, thanks.
It had a bug in the left-over loop was visible through shobits1's test. Fixed.

engkin, marcov, all - sorry that I was silent for so long. I was on a business trip without access to my development environment.

@engkin, marcov - attached pls find a small console app for benchmarking an SSE and AVX version (source of course). I have the core loops encoded for 64bit systems. The SSE variant should run on all Intel core from Nehalem onwards, while the AVX version should run on all Haswell and onwards. It would help me finalize the code (32/64 and SSE/AVX) if you could do some benchmarking and send me the generated output (a sample output is attached too). And of course feel free to study the asm-code - these are the fastest implementations I could come up with.

Kind regards,
MathMan

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: UTF8Length: A Faster Implementation.
« Reply #53 on: September 04, 2016, 05:01:29 pm »
An includefile longcalc.inc seems to be missing.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #54 on: September 05, 2016, 04:45:20 am »
@MathMan, Thank you, it looks awesome! Will have to dig further into the code. It will not work on my usual test computer. Will try it later on the other one. I doubt the AVX part will work.

@marcov, The missing include file is needed for HALFVAL, replacing HALFVAL with high(tHalfVal) might work?

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #55 on: September 05, 2016, 09:36:51 pm »
A small typo, the amount of memory allocated needs to be changed from:
Code: Pascal  [Select][+][-]
  1.     GetMem( Op1, 2*Stop );

to:
Code: Pascal  [Select][+][-]
  1.     GetMem( Op1, SizeOf(tLimb)*Stop );

I attached the results.

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: UTF8Length: A Faster Implementation.
« Reply #56 on: September 06, 2016, 07:12:41 pm »
@MathMan, Thank you, it looks awesome! Will have to dig further into the code. It will not work on my usual test computer. Will try it later on the other one. I doubt the AVX part will work.

@marcov, The missing include file is needed for HALFVAL, replacing HALFVAL with high(tHalfVal) might work?

Sorry for the omission - please see attached. Engkins remark was nearly correct though.

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: UTF8Length: A Faster Implementation.
« Reply #57 on: September 06, 2016, 07:17:21 pm »
A small typo, the amount of memory allocated needs to be changed from:
Code: Pascal  [Select][+][-]
  1.     GetMem( Op1, 2*Stop );

to:
Code: Pascal  [Select][+][-]
  1.     GetMem( Op1, SizeOf(tLimb)*Stop );

I attached the results.

Thanks for the measurement! What type of core is this? There is a 10-20% penalty for non-aligned access as far as I can see ...

In fact GetMem( Op1, 2*Stop ) is correct, as we are working on byte counts - however I think I messed up the data-preset routine, as this is still running on limb ... oh well, one shouldn't  do things fast as one has to do them twice then :-)
« Last Edit: September 06, 2016, 07:23:59 pm by MathMan »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: UTF8Length: A Faster Implementation.
« Reply #58 on: September 06, 2016, 08:12:34 pm »
here some of my machines. All machines were set to "high performance" profile before running

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #59 on: September 06, 2016, 09:10:54 pm »
What type of core is this? There is a 10-20% penalty for non-aligned access as far as I can see ...

I believe these results came from an i5.

Quote
In fact GetMem( Op1, 2*Stop ) is correct, as we are working on byte counts - however I think I messed up the data-preset routine, as this is still running on limb ... oh well, one shouldn't  do things fast as one has to do them twice then :-)

That explains it.

 

TinyPortal © 2005-2018