Recent

Author Topic: UTF8Length: A Faster Implementation.  (Read 32500 times)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #15 on: August 21, 2016, 09:52:27 am »
Any non-trivial test of UTF8Len in my 64-bit FPC gives an arithmetic overflow, coming from this: u * ONEMASK.
Example test: 'مAرBحCبD'
Please test with all checks enabled. My ONEMASK is now:
Code: Pascal  [Select][+][-]
  1. const
  2.   {$IfDef CPU64}
  3.   ONEMASK=$0101010101010101;
  4.   {$Else}
  5.   ONEMASK=$01010101;
  6.   {$EndIf}
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: UTF8Length: A Faster Implementation.
« Reply #16 on: August 21, 2016, 10:10:28 am »
If that is the case it really should not be that difficult to create an efficient UTF8Length routine..

Yes, the current UTF8Length is also relatively efficient. It iterates byte by byte and calls UTF8CharacterLength which also checks the validity.
The UTF8Len proposed by engkin is extremely fast because it assumes the data is valid (no check) and it uses the CPU's native data size. In a 64-bit CPU it means 8 bytes at once. The logic operations required for it are rather difficult to follow.
:)
It would be fascinating to have an IFDEFed assembly version with SIMD instructions for CPUs that support it. How much faster would it be?

I'll give it a try - separate versions for SSE and AVX (I assume yes but pls confirm)?

Kind regards,
MathMan

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
Re: UTF8Length: A Faster Implementation.
« Reply #17 on: August 21, 2016, 03:10:27 pm »
utf8len seems to me only to be needed for rendering, so I don't really understand why it is so important that it is fast?

Awkward

  • Full Member
  • ***
  • Posts: 150
Re: UTF8Length: A Faster Implementation.
« Reply #18 on: August 21, 2016, 03:24:05 pm »
utf8len seems to me only to be needed for rendering, so I don't really understand why it is so important that it is fast?
Perfectionism? Why not?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #19 on: August 21, 2016, 04:48:19 pm »
utf8len seems to me only to be needed for rendering, so I don't really understand why it is so important that it is fast?

True, it can be seen as just an interesting exercise. However in the rare cases where counting codepoints is needed, it can make a big difference.
Optimizing UTF8CharacterLength will benefit more code. It is used by many algorithms. I will add an inlined version that uses a single case and has no validity checks.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #20 on: August 21, 2016, 06:45:37 pm »
I added function named UTF8LengthFast in r52853.
At the same go I added function UTF8CharacterLengthFast which is inlined.
Both of these functions are now used in unit LazUnicode.
Please test.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Fungus

  • Sr. Member
  • ****
  • Posts: 354
Re: UTF8Length: A Faster Implementation.
« Reply #21 on: August 21, 2016, 07:23:39 pm »
I added function named UTF8LengthFast in r52853.
At the same go I added function UTF8CharacterLengthFast which is inlined.
Both of these functions are now used in unit LazUnicode.
Please test.

While you're at it; Wouldn't it be nice to have the XorEncodeBase64 / XorDecodeBase64 from here: http://forum.lazarus.freepascal.org/index.php/topic,33743.msg219149.html#msg219149 in unit "strutils" as an alternative to the hexadecimal XorEncode / XorDecode?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #22 on: August 21, 2016, 07:37:30 pm »
While you're at it; Wouldn't it be nice to have the XorEncodeBase64 / XorDecodeBase64 from here: http://forum.lazarus.freepascal.org/index.php/topic,33743.msg219149.html#msg219149 in unit "strutils" as an alternative to the hexadecimal XorEncode / XorDecode?

Unit "strutils" belongs to FPC project. You should offer the functions for them in the bug tracker as a feature request.
Lazarus project can also add helper functions if they are not found elsewhere and they are needed somehow for Lazarus development.
These Encode / Decode functions may not qualify.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #23 on: August 21, 2016, 10:23:28 pm »
I'll give it a try - separate versions for SSE and AVX (I assume yes but pls confirm)?

You must use your own judgment. How much time you want to use for an optimization which will not be used often?
If you have very much spare time then you should consider implementing also other more urgent features.

I had to check AVX from the net. I remembered SSE and SSE2. ARM has its NEON SIMD extension. PowerPC has its own and I guess MIPS, too. If you want to do this experiment, it should be for the most commonly used SIMD extension, I guess SSE2.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

shobits1

  • Sr. Member
  • ****
  • Posts: 271
  • .
Re: UTF8Length: A Faster Implementation.
« Reply #24 on: August 22, 2016, 02:49:29 am »
I added function named UTF8LengthFast in r52853.
At the same go I added function UTF8CharacterLengthFast which is inlined.
Both of these functions are now used in unit LazUnicode.
Please test.
I have done new benchmark (using r52857) and the result as follow:



Function Used                                :Reported Len:   Ticks count     : Rank
file (~30MB) of 1 unit code.
UTF8LengthFast(the ported one)               : 30561000   :   23037459 ticks  :  1
UTF8LengthFast(using UTF8CharacterLengthFast): 30561000   :   69586175 ticks  :  4
UTF8Length(New one inlined)                  : 30561000   :   75414132 ticks  :  5
UTF8Length(old not inlined)                  : 30561000   :   82375610 ticks  :  6
UTF8Length(old with inline)                  : 30561000   :   63389867 ticks  :  3
myUTF8Length                                 : 30561000   :   50656775 ticks  :  2

file (~30MB) of 2 unit code.
UTF8LengthFast(the ported one)               : 15345000   :   22626584 ticks  :  1
UTF8LengthFast(using UTF8CharacterLengthFast): 15345000   :   36975480 ticks  :  3
UTF8Length(New one inlined)                  : 15345000   :   52764933 ticks  :  6
UTF8Length(old not inlined)                  : 15345000   :   48531697 ticks  :  5
UTF8Length(old with inline)                  : 15345000   :   39306504 ticks  :  4
myUTF8Length                                 : 15345000   :   30091740 ticks  :  2


file (~30MB) of 3 unit code.
UTF8LengthFast(the ported one)               : 10956000   :   24117156 ticks  :  1
UTF8LengthFast(using UTF8CharacterLengthFast): 10956000   :   26133029 ticks  :  3
UTF8Length(New one inlined)                  : 10956000   :   40581973 ticks  :  6
UTF8Length(old not inlined)                  : 10956000   :   39068618 ticks  :  5
UTF8Length(old with inline)                  : 10956000   :   33174983 ticks  :  4
myUTF8Length                                 : 10956000   :   25320944 ticks  :  2


file (~30MB) of 4 unit code.
UTF8LengthFast(the ported one)               : 8063880   :   23869860 ticks  :  3
UTF8LengthFast(using UTF8CharacterLengthFast): 8063880   :   21494261 ticks  :  2
UTF8Length(New one inlined)                  : 8063880   :   30391167 ticks  :  5
UTF8Length(old not inlined)                  : 8063880   :   34055401 ticks  :  6
UTF8Length(old with inline)                  : 8063880   :   27949814 ticks  :  4
myUTF8Length                                 : 8063880   :   19132335 ticks  :  1



file (~30MB) of mixed unit codes.
UTF8LengthFast(the ported one)               : 11664000   :   24735871 ticks  :  1
UTF8LengthFast(using UTF8CharacterLengthFast): 11664000   :   62310119 ticks  :  4
UTF8Length(New one inlined)                  : 11664000   :   71513291 ticks  :  6
UTF8Length(old not inlined)                  : 11664000   :   66776292 ticks  :  5
UTF8Length(old with inline)                  : 11664000   :   56256012 ticks  :  3
myUTF8Length                                 : 11664000   :   32851130 ticks  :  2




EDIT:
Those results were done with -O3 (good)
but I forgot to uncheck Checks and assertion and Heaptrc (so stupid ah) :-[  ( maybe it's the lack of sleep) ... so those results may not correct.
I'll redone the benchmark tomorrow (I mean today going to sleep now).
« Last Edit: August 22, 2016, 04:04:45 am by shobits1 »

shobits1

  • Sr. Member
  • ****
  • Posts: 271
  • .
Re: UTF8Length: A Faster Implementation.
« Reply #25 on: August 22, 2016, 02:49:53 am »
bench-marking methodology:
- I used 5 files with size of about 30MB,  each one but only type of character (1,2,3,4) unit code character expect the fifth which has mixed characters.
- using engkin HardwareTicks function to get the ticks; ( thank you engkin)
- First, I load the file with TMemoryStream then copy it to TStringStream.
- Looping 7 times (I ignore the first two) and calling each UTF8Len** passing (TStringStream.DataString) after that calculate the mean of each function ticks.
sorry for the blabla  :-[ ...... maybe the code will provide clearer description.

Code: Pascal  [Select][+][-]
  1.   ms := TMemoryStream.Create;
  2.   try
  3.     ms.LoadFromFile(p5);
  4.     //ss := TStringStream.Create('$¢€𐍈سé');
  5.     ss := TStringStream.Create('');
  6.     try
  7.       ss.CopyFrom(ms,0);
  8.  
  9.       hL0 := 0; hL1 := 0; hL2 := 0; hL3 := 0; hL4 := 0; hL5 := 0; hL6 := 0;
  10.       for i:= 0 to 6 do
  11.       begin
  12.         ht0 := HardwareTicks; L0 := UTF8LengthFast_Deamon(ss.DataString);
  13.         ht1 := HardwareTicks; L1 := UTF8LengthFast(ss.DataString);
  14.         ht2 := HardwareTicks; L2 := UTF8LengthN(ss.DataString);
  15.         ht3 := HardwareTicks; L3 := mebUTF8Length(ss.DataString);
  16.         ht4 := HardwareTicks; L4 := mUTF8Length(ss.DataString);
  17.         ht5 := HardwareTicks; L5 := myUTF8Len1(ss.DataString);
  18.         ht6 := HardwareTicks; L6 := myUTF8Len2(ss.DataString);
  19.         ht7 := HardwareTicks;
  20.         if i > 1 then
  21.         begin
  22.           hL0 := ht1 - ht0;
  23.           hL1 := ht2 - ht1;
  24.           hL2 := ht3 - ht2;
  25.           hL3 := ht4 - ht3;
  26.           hL4 := ht5 - ht4;
  27.           hL5 := ht6 - ht5;
  28.           hL6 := ht7 - ht6;
  29.         end;
  30.       end;
  31.       hL0 := hL0 div 5;
  32.       hL1 := hL1 div 5;
  33.       hL2 := hL2 div 5;
  34.       hL3 := hL3 div 5;
  35.       hL4 := hL4 div 5;
  36.       hL5 := hL5 div 5;
  37.       hL6 := hL6 div 5;
  38.  
  39.       Memo1.Lines.Add('UTF8LengthFast_Deamon: ' + IntToStr(L0) + '   :   ' + IntToStr(hL0) + ' ticks' );
  40.       Memo1.Lines.Add('UTF8LengthFast       : ' + IntToStr(L1) + '   :   ' + IntToStr(hL1) + ' ticks' );
  41.       Memo1.Lines.Add('UTF8LengthN          : ' + IntToStr(L2) + '   :   ' + IntToStr(hL2) + ' ticks' );
  42.       Memo1.Lines.Add('mebUTF8Length(inline): ' + IntToStr(L3) + '   :   ' + IntToStr(hL3) + ' ticks' );
  43.       Memo1.Lines.Add('mUTF8Length          : ' + IntToStr(L4) + '   :   ' + IntToStr(hL4) + ' ticks' );
  44.       Memo1.Lines.Add('myUTF8Len1           : ' + IntToStr(L5) + '   :   ' + IntToStr(hL5) + ' ticks' );
  45.       Memo1.Lines.Add('myUTF8Len2           : ' + IntToStr(L6) + '   :   ' + IntToStr(hL6) + ' ticks' );
  46.  
  47.     finally
  48.       ss.Free;
  49.     end;
  50.   finally
  51.     ms.Free;
  52.   end;
  53.  



myUTF8Length uses a very simple way to return char length (is this correct for valid utf-8 char)

Code: Pascal  [Select][+][-]
  1.  
  2. function myCharLen(p: PChar): integer;
  3. begin
  4.   Result := 1;
  5.   case Ord(p^) and %11110000 of
  6.     %11110000 : Result := 4;
  7.     %11100000 : Result := 3;
  8.     %11010000 : Result := 2;
  9.     %11000000 : Result := 2;
  10.   end;
  11. end;
  12.  

As you can see UTF8LengthFast(the ported one) is doing very well for 1uc and 2uc and almost the same as my function for 3uc and loses for 4uc (given the clearity of my function I'd say it is doing well).  :D

The UTF8CharacterLengthFast is not that fast and loses to the old function UTF8CharacterLength (inlined) for 1uc and mixed chars;  :o

The new UTF8Length is worst that the old one with UTF8CharacterLength (inlined); I think reverting to old code is better here;  >:(
« Last Edit: August 22, 2016, 02:55:31 am by shobits1 »

MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: UTF8Length: A Faster Implementation.
« Reply #26 on: August 22, 2016, 07:37:27 am »
I'll give it a try - separate versions for SSE and AVX (I assume yes but pls confirm)?

You must use your own judgment. How much time you want to use for an optimization which will not be used often?
If you have very much spare time then you should consider implementing also other more urgent features.

I had to check AVX from the net. I remembered SSE and SSE2. ARM has its NEON SIMD extension. PowerPC has its own and I guess MIPS, too. If you want to do this experiment, it should be for the most commonly used SIMD extension, I guess SSE2.

It's simple curiosity on my side. While I don't have more spare time at hand than average Joe I'm mostly deep into asm land with my own projects.

So I gave the AVX variant a try yesterday and came up with a nice core Loop. The algorithm shown @ the original link is quite easy to understand - the tricky part is to count the flag bits. There is no AVX mnemonic doing this. I finally used a variant with bitmask extraction and separate popcount like the following

Code: Pascal  [Select][+][-]
  1.     align   16
  2.   .Loop:
  3.  
  4.     popcnt  R8, R8              ; 1 1 p1       3   1
  5.     popcnt  R9, R9              ; 1 1 p1       3   1
  6.     vpsllq  YMM3, YMM1, 1       ; 1 1 p01      1   0.5
  7.     vpsllq  YMM4, YMM2, 1       ; 1 1 p01      1   0.5
  8.     vpand   YMM5, YMM1, YMM0    ; 1 1 p015     1   0.33
  9.     vpand   YMM6, YMM2, YMM0    ; 1 1 p015     1   0.33
  10.     vpandn  YMM3, YMM3, YMM0    ; 1 1 p015     1   0.33
  11.     vpandn  YMM4, YMM4, YMM0    ; 1 1 p015     1   0.33
  12.     vmovdqa YMM1, [Op1]         ; 1 1 p23      3   0.5
  13.     vmovdqa YMM2, [Op1+32]      ; 1 1 p23      3   0.5
  14.     vpand   YMM5, YMM5, YMM3    ; 1 1 p015     1   0.33
  15.     vpand   YMM6, YMM6, YMM4    ; 1 1 p015     1   0.33
  16.     add     RAX, R8             ; 1 1 p0156    1   0.25
  17.     vpmovmskb R8, YMM5          ; 1 1 p0       2-3 1
  18.     add     RAX, R9             ; 1 1 p0156    1   0.25
  19.     vpmovmskb R9, YMM6          ; 1 1 p0       2-3 1
  20.  
  21.     add     Op1, 64             ; 1 1 p0156    1   0.25
  22.  
  23.   .LoopCheck:
  24.  
  25.     sub     Size1, 8            ; 1 1 p0156    1   0.25
  26.     jnc     .Loop
  27.  

It is not optimal wrt mixing SSE (popcnt) and AVX (the rest) instructions. This is no issue for my Skylake system but according to Agner Fog it can generate stalls on older archs. Also I had to separate the popcnt from the vpmovmskb as much as possible as they have high latencies. Nevertheless the loop runs at <5 cycles in L1D$ and <11 cycles in main memory.

While going through it I think I discovered a glitch in the Pascal Version from engkin - I think the initial alignment step is incorrect, as it is not considering the size handed over and it is not really aligning (if I understood the code correct).

Can you private mail me regarding the other more urgent features you are referring to?

Regards,
MathMan
« Last Edit: August 22, 2016, 08:18:44 am by MathMan »

MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: UTF8Length: A Faster Implementation.
« Reply #27 on: August 22, 2016, 09:45:59 am »
bench-marking methodology:
- I used 5 files with size of about 30MB,  each one but only type of character (1,2,3,4) unit code character expect the fifth which has mixed characters.
- using engkin HardwareTicks function to get the ticks; ( thank you engkin)
- First, I load the file with TMemoryStream then copy it to TStringStream.
- Looping 7 times (I ignore the first two) and calling each UTF8Len** passing (TStringStream.DataString) after that calculate the mean of each function ticks.
sorry for the blabla  :-[ ...... maybe the code will provide clearer description.

Code: Pascal  [Select][+][-]
  1. ...
  2.         if i > 1 then
  3.         begin
  4.           hL0 := ht1 - ht0;
  5.           hL1 := ht2 - ht1;
  6.           hL2 := ht3 - ht2;
  7.           hL3 := ht4 - ht3;
  8.           hL4 := ht5 - ht4;
  9.           hL5 := ht6 - ht5;
  10.           hL6 := ht7 - ht6;
  11.         end;
  12.       end;
  13.       hL0 := hL0 div 5;
  14.       hL1 := hL1 div 5;
  15.       hL2 := hL2 div 5;
  16.       hL3 := hL3 div 5;
  17.       hL4 := hL4 div 5;
  18.       hL5 := hL5 div 5;
  19.       hL6 := hL6 div 5;
  20. ...
  21.  


As you can see UTF8LengthFast(the ported one) is doing very well for 1uc and 2uc and almost the same as my function for 3uc and loses for 4uc (given the clearity of my function I'd say it is doing well).  :D

The UTF8CharacterLengthFast is not that fast and loses to the old function UTF8CharacterLength (inlined) for 1uc and mixed chars;  :o

The new UTF8Length is worst that the old one with UTF8CharacterLength (inlined); I think reverting to old code is better here;  >:(

I think the tick counts you stated in your previous post are plain wrong unfortunately. You should either sum the hLx numbers, or not divide by 5. The way you measure above simply yields a fifth of the last execution time. But as you still had debug on they would probably be underestimated anyway ...

Kind regards,
Jens

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
Re: UTF8Length: A Faster Implementation.
« Reply #28 on: August 22, 2016, 11:06:38 am »
It's simple curiosity on my side. While I don't have more spare time at hand than average Joe I'm mostly deep into asm land with my own projects.

I've also considered doing a SSE2/3 since I have been making simple SSE routines the last few weeks.  I simply thought to compare with <128 and >=192 with pcmpb (which delivers two registers that are either 00 or ff) and then OR those two results together and then AND with constant 1 to make it 0 or 1.  Accumulate that in another register, and once every 250 times or so add all elements vertically.

// align first

// this loop maximally 255 times
 pxor xmm7,xmm7
.Lsomething:
  movdqa xmm5,[rax] 
  movdaq xmm6,<128 in each byte>
  pcmpgtb xmm6,xmm5                        // if 128 is greater, we are smaller.
  pcmpgtb xmm5,<register filled with 191 in each byte>
  porub    xmm6,xmm5   // either ascii (<128) or start of sequence (11 bits, >=192)
  and   xmm6, <1 in each byte>
  add  xmm7,xmm6  //sum
  dec  r10
  jne r10

add them using http://stackoverflow.com/questions/10930595/sse-instructions-to-add-all-elements-of-an-array

  movdqa  xmm6,xmm7
  PUNPCKLBW xmm7,<all 0s>
  PUNPCKHBW xmm6,<all 0s>
  PMADDWD xmm7,<all 1>
  PMADDWD xmm6,<all 1>


The avx way would be slightly better because you could save some moves becuase of the 3 address code (but while new production is all AVX2, old machines might be core2, so I don't do much avx yet)

Is your code generated by some tool? What do the comments mean?




MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: UTF8Length: A Faster Implementation.
« Reply #29 on: August 22, 2016, 12:01:02 pm »
It's simple curiosity on my side. While I don't have more spare time at hand than average Joe I'm mostly deep into asm land with my own projects.

I've also considered doing a SSE2/3 since I have been making simple SSE routines the last few weeks.  I simply thought to compare with <128 and >=192 with pcmpb (which delivers two registers that are either 00 or ff) and then OR those two results together and then AND with constant 1 to make it 0 or 1.  Accumulate that in another register, and once every 250 times or so add all elements vertically.

....

Is your code generated by some tool? What do the comments mean?

I'll have a look at your approach - give me a bit of time to test it.

No, my code is hand crafted but in early development phases I usually comment the key figures from Agner Fogs mnemonics tables. It's the number of macro-ops, the number of micro-ops, the possible execution untits to which the opcode can be scheduled, the latency in clock cycles and the inverse throughput. Helps me finding better operand schedules ...

Kind regards,
MathMan

 

TinyPortal © 2005-2018