Recent

Author Topic: UTF8Length: A Faster Implementation.  (Read 32501 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
Re: UTF8Length: A Faster Implementation.
« Reply #30 on: August 22, 2016, 02:43:47 pm »
Btw, I'd like if you do it in AVX. I have less experience there, so all input is welcome.


JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #31 on: August 22, 2016, 04:09:24 pm »
Can you private mail me regarding the other more urgent features you are referring to?

I answer here because I didn't have any specific issue in mind.
Anyway, bug tracker has > 2500 open issues for FPC and Lazarus projects together, both bug reports and feature requests.
Some widgetsets in Lazarus are in alpha or beta state (GTK3, CustomDrawn, Cocoa).
The Lazarus online package manager is still missing. Pascal debugger does not work well yet. And so on ...
Just pick your favorite bug / feature.
I believe some FPC tasks require assembly skills, too.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: UTF8Length: A Faster Implementation.
« Reply #32 on: August 22, 2016, 05:18:35 pm »
Can you private mail me regarding the other more urgent features you are referring to?

I answer here because I didn't have any specific issue in mind.
Anyway, bug tracker has > 2500 open issues for FPC and Lazarus projects together, both bug reports and feature requests.
Some widgetsets in Lazarus are in alpha or beta state (GTK3, CustomDrawn, Cocoa).
The Lazarus online package manager is still missing. Pascal debugger does not work well yet. And so on ...
Just pick your favorite bug / feature.
I believe some FPC tasks require assembly skills, too.

Just scanned the first 200 (and some). What I found was mostly related to OOP tasks - an area which I am currently lacking as my own projects are still 100% procedural Pascal. Might be that I can support on the Pascal debugger issues. I'll have a closer look in the coming two weeks and see if I can be of assistance - unfortunately this is what I can offer as of now.

If you stumble across a task that relates to my skills - including some hard-core asm hacking - please let me know! I am willing to help - just need the right task to do.
« Last Edit: August 22, 2016, 05:22:09 pm by MathMan »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #33 on: August 22, 2016, 11:42:55 pm »
@MathMan,
I am so delighted to see you participating here.

While going through it I think I discovered a glitch in the Pascal Version from engkin - I think the initial alignment step is incorrect, as it is not considering the size handed over and it is not really aligning (if I understood the code correct).
You are right about not considering the size in the initial loop. As for the alignment, the loop the handles blocks needs pointer-size boundaries alignment (4 or 8 depends on the pointer size), while SSE, for instance, needs 16-byte alignment.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #34 on: August 22, 2016, 11:50:40 pm »
@shobits1 Great work. I'll have to work on it later. Thank you!

MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: UTF8Length: A Faster Implementation.
« Reply #35 on: August 23, 2016, 07:54:19 am »
@MathMan,
I am so delighted to see you participating here.

While going through it I think I discovered a glitch in the Pascal Version from engkin - I think the initial alignment step is incorrect, as it is not considering the size handed over and it is not really aligning (if I understood the code correct).
You are right about not considering the size in the initial loop. As for the alignment, the loop the handles blocks needs pointer-size boundaries alignment (4 or 8 depends on the pointer size), while SSE, for instance, needs 16-byte alignment.

Issue is, if you look at the specific portion of code

Code: Pascal  [Select][+][-]
  1. for i := 1 to ui and (sizeof(u) - 1) do
  2. begin
  3.     b := pb^;
  4.  
  5.     { Is this byte NOT the first byte of a character? }
  6.     Result += (b shr 7) and ((not b) shr 6);
  7.     inc(pb);
  8. end;
  9.  

In case UI has a value of $xx..xx7 with SizeOf( U )=8 the loop will run from 1 to 7 while it should only run from 1 to 1!? Or is some pointer address mingling done by the "absolute" statement in the variable declaration?

Regards,
MathMan


marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
Re: UTF8Length: A Faster Implementation.
« Reply #36 on: August 23, 2016, 09:55:45 am »

I think something with inverse logic  (and with 192, then compare with 128 to find non lead bytes) might be cheaper

  pandb     xmm1,<192 in each byte>      // mask to get top two bits
  pcmpeqb xmm1,<128 in each byte>      // compare to "10xxxxxxxx"
  pandb     xmm1,<all 1s>                        // convert $FF equal value to 1's
  paddb     xmm2,xmm1                           // accumulate


MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: UTF8Length: A Faster Implementation.
« Reply #37 on: August 23, 2016, 10:11:12 am »

I think something with inverse logic  (and with 192, then compare with 128 to find non lead bytes) might be cheaper

  pandb     xmm1,<192 in each byte>      // mask to get top two bits
  pcmpeqb xmm1,<128 in each byte>      // compare to "10xxxxxxxx"
  pandb     xmm1,<all 1s>                        // convert $FF equal value to 1's
  paddb     xmm2,xmm1                           // accumulate

Yes! I came to the same conclusion yesterday and designed a new version of the core loop I presented earlier. On my Skylake system I now measure 870 cycles for 16384 byte strings. Point is on the Skylake architecture I can completely "hide" misaligned fetch penalties in my AVX code - but according to Agner Fog this may not be possible for older architectures like Sandy- or Ivy-Bridge or Has- and Broadwell. Do you have access to such systems? I would then send you some code for benchmarking. I would like to understand if I need an initial alignment phase or not - without it the resulting code will be much simpler.

Regards,
MathMan

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #38 on: August 23, 2016, 10:46:09 am »
Code: Pascal  [Select][+][-]
  1. for i := 1 to ui and (sizeof(u) - 1) do
  2. begin
  3.     b := pb^;
  4.     { Is this byte NOT the first byte of a character? }
  5.     Result += (b shr 7) and ((not b) shr 6);
  6.     inc(pb);
  7. end;
In case UI has a value of $xx..xx7 with SizeOf( U )=8 the loop will run from 1 to 7 while it should only run from 1 to 1!? Or is some pointer address mingling done by the "absolute" statement in the variable declaration?

If I understand right the code gives correct results but the alignment of the main loop pointer is wrong. Intel CPUs allow it but it's slower. It means the code becomes still faster when the bug is fixed. Am I right?
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: UTF8Length: A Faster Implementation.
« Reply #39 on: August 23, 2016, 12:13:15 pm »
Code: Pascal  [Select][+][-]
  1. for i := 1 to ui and (sizeof(u) - 1) do
  2. begin
  3.     b := pb^;
  4.     { Is this byte NOT the first byte of a character? }
  5.     Result += (b shr 7) and ((not b) shr 6);
  6.     inc(pb);
  7. end;
In case UI has a value of $xx..xx7 with SizeOf( U )=8 the loop will run from 1 to 7 while it should only run from 1 to 1!? Or is some pointer address mingling done by the "absolute" statement in the variable declaration?

If I understand right the code gives correct results but the alignment of the main loop pointer is wrong. Intel CPUs allow it but it's slower. It means the code becomes still faster when the bug is fixed. Am I right?

No - my explanation was a bit short. For Intel CPUs one has the following requirements for aligned access

* BYTE - e.g. mov AL, [RSI] - RSI can be arbitrary
* WORD - e.g. mov AX, [RSI] - RSI must be divisible by 2
* DWORD - e.g. mov EAX, [RSI] - RSI must be divisible by 4
* QWORD - e.g. mov RAX, [RSI] - RSI must be divisible by 8

These are the standard operands for which every access not conforming to above will generate a seg-fault. It is a bit different for the SIMD operands because there are explicit mnemonics for aligned and non-aligned operand moves

* MMX and XMM - e.g. movdqa (aligned) and movdqu (unaligned)
* YMM - e.g. vmovdqa (aligned) and vmovdqu (unaligned)

There are penalties for mis-aligned reads & writes here - heavier for writes than for reads. And again trying to do an aligned read from a mis-aligned pointer will generate a seg-fault.

Engins alignment loop above only worked because it's actually never executed as the pointers handed over to the function were already aligned. So there is no speedup to be expected for Engkins Pascal implementation.

Different beast for the assembler version I am working on which utilizes SIMD instructions.

Hope this clarifies?

Regards,
MathMan

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #40 on: August 23, 2016, 12:34:51 pm »
Engins alignment loop above only worked because it's actually never executed as the pointers handed over to the function were already aligned.

Oh! I didn't know Strings were aligned like that.
So, can the initial alignment loop be removed or is there a compiler switch that makes Strings non-aligned?
In the latter case the loop must be fixed instead.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

MathMan

  • Sr. Member
  • ****
  • Posts: 411
Re: UTF8Length: A Faster Implementation.
« Reply #41 on: August 23, 2016, 12:59:53 pm »
Engins alignment loop above only worked because it's actually never executed as the pointers handed over to the function were already aligned.

Oh! I didn't know Strings were aligned like that.
So, can the initial alignment loop be removed or is there a compiler switch that makes Strings non-aligned?
In the latter case the loop must be fixed instead.

One shouldn't assume strings to be aligned (e.g. if you have a packed record containing several strings in conjunction) - so I would advice to fix the alignment loop! In the specific case I think the strings were aligned as they originated from constants which I think the compiler aligns (even though not necessary) - but see above case of packed records.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12314
  • FPC developer.
Re: UTF8Length: A Faster Implementation.
« Reply #42 on: August 23, 2016, 03:48:53 pm »
but according to Agner Fog this may not be possible for older architectures like Sandy- or Ivy-Bridge or Has- and Broadwell. Do you have access to such systems? I would then send you some code for benchmarking. I would like to understand if I need an initial alignment phase or not - without it the resulting code will be much simpler.

Among work and home, I have everything except nehalem. So core2 (conroe, the 6000 series) core2 (wolfdale 8000 series), sandy bridge (2000), ivy bridge(3000), haswell(4000), broadwell(5000) and skylake(6000), and a AMD kaveri (a10-7850k), all windows 10.

But I code both at home and at work in Lazarus/Delphi, just more lazarus at home, and more delphi at work. Delphi doesn't support avx though, so avx(2) are FPC (DLLs) at work too.

Work where the last bit of performance matters can assume Skylake or Broadwell-E

I also have pentium D's (800 series), but those are harder to test (and XP 32-bit or various BSDs)

P.s. I'm currently working on erosion/dilation https://felix.abecassis.me/2012/03/sse-image-processing/  which seems to require unaligned loads.
« Last Edit: August 24, 2016, 01:04:42 pm by marcov »

shobits1

  • Sr. Member
  • ****
  • Posts: 271
  • .
Re: UTF8Length: A Faster Implementation.
« Reply #43 on: August 24, 2016, 06:40:35 am »
Sorry for the late response, but late  better than never ( literally here  ;) )
anyway, changes to the benchmark are:
1- each of the utf-8 files (4 files) contains every possible character then repeated the sequence to get about 30MB file, those files contain only one type of character (1,2,3,4 unit code) no mixing.
2- a fifth file (mixed) contains characters generated randomly but with ratio :  3/8 of 1uc | 2/8 of 2uc | 2/8 3uc | 1/8 4uc (this is also a 30 MB file).
3- the ticks counting is corrected (thanks to MathMan) now it sums up all ticks.
4- 8 times sampling  for more accuracy
5- obviously   ;), -O3 optimization and no Debug // tried ( -OpPENTIUM4 -CpPENTIUM4 -CfSSE2 -Sv ) didn't see any obvious improvement so removed them.

file (~30MB) of 1 unit code
UTF8LengthFast(ported one)       : 31457409   :    85430867 ticks   :  2
UTF8LengthFast(alternative)      : 31457409   :   195544232 ticks   :  4
UTF8Length(New one inlined)      : 31457409   :   259400705 ticks   :  5
UTF8Length(old with inline)      : 31457409   :   163843616 ticks   :  3
UTF8Length(old not inlined)      : 31457409   :   260854127 ticks   :  6
myUTF8Length                     : 31457409   :    54935845 ticks   :  1
myUTF8Length                     : 31457409   :    60023767 ticks   :  1

file (~30MB) of 2 unit code.
UTF8LengthFast(ported one)       : 15730689   :    85999892 ticks   :  2
UTF8LengthFast(alternative)      : 15730689   :   101618265 ticks   :  3
UTF8Length(New one inlined)      : 15730689   :   211503264 ticks   :  6
UTF8Length(old with inline)      : 15730689   :   115656393 ticks   :  4
UTF8Length(old not inlined)      : 15730689   :   179914878 ticks   :  5
myUTF8Length                     : 15730689   :    55063750 ticks   :  1
myUTF8Length                     : 15730689   :    61505693 ticks   :  1

file (~30MB) of 3 unit code.
UTF8LengthFast(ported one)       : 10551297   :    85969376 ticks   :  3
UTF8LengthFast(alternative)      : 10551297   :    81757048 ticks   :  2
UTF8Length(New one inlined)      : 10551297   :   154006562 ticks   :  5
UTF8Length(old with inline)      : 10551297   :   111099328 ticks   :  4
UTF8Length(old not inlined)      : 10551297   :   154160348 ticks   :  6
myUTF8Length                     : 10551297   :    55238042 ticks   :  1
myUTF8Length                     : 10551297   :    59994462 ticks   :  1

file (~30MB) of 4 unit code.
UTF8LengthFast(ported one)      :  8388609   :    91287074 ticks   :  3
UTF8LengthFast(alternative)     :  8388609   :    64810432 ticks   :  2
UTF8Length(New one inlined)     :  8388609   :   148215362 ticks   :  5
UTF8Length(old with inline)     :  8388609   :   114640392 ticks   :  4
UTF8Length(old not inlined)     :  8388609   :   148426924 ticks   :  6
myUTF8Length                    :  8388609   :    58317539 ticks   :  1
myUTF8Length                    :  8388609   :    66049761 ticks   :  1

file (~30MB) of mixed unit codes.
UTF8LengthFast(ported one)      : 14880186   :    86344084 ticks   :  2
UTF8LengthFast(alternative)     : 14880186   :   262257014 ticks   :  3
UTF8Length(New one inlined)     : 14880186   :   377428841 ticks   :  6
UTF8Length(old with inline)     : 14880186   :   281502349 ticks   :  4
UTF8Length(old not inlined)     : 14880186   :   342742638 ticks   :  5
myUTF8Length                    : 14880186   :    55384089 ticks   :  1
myUTF8Length                    : 14880186   :    60267912 ticks   :  1


All the previous remarks remains about the same here except my modified function (always first  :D ), and again the old UTF8Length is better than the new one when inlined, so @juha you may consider reverting to the old UTF8Length.

the myUTF8Length (I hope there is nothing wrong)
Code: Pascal  [Select][+][-]
  1. function myUTF8Len(p: PChar; ByteCount: PtrInt): PtrInt;
  2. var
  3. pu:PPtrUInt;
  4. pb: PBYTE absolute pu;
  5. p4:PtrUInt;
  6. b: BYTE;
  7.  
  8. begin
  9.   Result := 0;
  10.   pu := PPtrUInt(p);
  11.  
  12.   while (ByteCount>=4) do
  13.   begin
  14.     p4 := pu^;
  15.  
  16. //    p4 := ((p4 and $80808080) shr 7) and not( p4 shr 6);  << INCORRECT
  17.        p4 := (((p4 and $80808080) shr 7) and ((not p4) shr 6)) xor $01010101;
  18. //    p4 := (p4 * $01010101) shr 24;
  19.     Inc (Result, (p4 * $01010101) shr 24 );  // << Faster
  20.  
  21.     inc(pu);
  22.     dec(ByteCount, 4);
  23.   end;
  24.  
  25.   // HANDLE LEFT OVER BYTES
  26.   while (ByteCount>0) do
  27.   begin
  28.     b := pb^;
  29.     case b of
  30.       0..127   : inc(Result);
  31.       192..223 : inc(Result);
  32.       224..239 : inc(Result);
  33.       240..247 : inc(Result);
  34.     end;
  35.     inc(pb);
  36.     dec(ByteCount);
  37.   end;
  38. end;
  39.  

Thank you.

EDIT:
due to my negligence and some mix up (due to copy-paste), myUTF8Length now has been corrected in code and benchmark result,, sorry.
« Last Edit: August 24, 2016, 10:39:15 am by shobits1 »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #44 on: August 24, 2016, 09:13:57 am »
@MathMan, Good catch!


@Shobits1, impressive!  Would you mind testing UTF8Len_SIMD:
Code: Pascal  [Select][+][-]
  1. {$asmmode intel}
  2.  
  3. function UTF8Len_aligned16(p16: pchar; BlockCount: PtrInt):PtrInt;assembler;
  4. const
  5.   ZEROMASK  :array[0..15] of byte=($00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00);
  6.   ONEMASK   :array[0..15] of byte=($01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01,$01);
  7.   ONEMASKx80:array[0..15] of byte=($80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80,$80);
  8.   ONEMASKxFF:array[0..15] of byte=($FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF,$FF);
  9. Label
  10.  loop;
  11. asm
  12.   push ecx
  13.   push edi
  14.   push ebx
  15.  
  16.   { tmp counter }
  17.   MOV ecx, 0
  18.  
  19.   { masks }
  20.   MOVDQA xmm0, ZEROMASK
  21.   MOVDQA xmm1, ONEMASK
  22.   MOVDQA xmm2, ONEMASKx80
  23.   MOVDQA xmm3, ONEMASKxFF
  24.  
  25. Loop:
  26.   { get 16 bytes }
  27.   MOVDQA xmm4, [p16]
  28.  
  29.   { Invert 16 bytes }
  30.   MOVDQA xmm5, xmm4
  31.   ANDNPD xmm5, xmm3 { xmm5 = not xmm4}
  32.  
  33.   { Shift the inverted bytes 6 bits to the right }
  34.   PSRLQ  xmm5, 6
  35.  
  36.   { Keep msb of each non-inverted byte }
  37.   PAND xmm4, ONEMASKx80
  38.  
  39.   { Shift them to right 7 bits }
  40.   PSRLQ  xmm4, 7 { Shift Right Logical QWord }
  41.  
  42.   { A one in the 1st bit means: NOT the first byte of a codepoint }
  43.   PAND xmm5, xmm4
  44.  
  45.   { Count them ;-) }
  46.   PSADBW xmm5, xmm0
  47.  
  48.   MOVD edi, xmm5
  49.   PEXTRW ebx, xmm5, 4
  50.  
  51.   ADD ecx, ebx
  52.   ADD ecx, edi
  53.  
  54.   { Next 16 bytes }
  55.   ADD p16, 16
  56.   DEC edx
  57.   JNZ Loop
  58.  
  59.   { Result }
  60.   Mov eax, ecx
  61.  
  62.   pop ebx
  63.   pop edi
  64.   pop ecx
  65. end;
  66.  
  67. function UTF8Len_SIMD(p: PChar; ByteCount: PtrInt): PtrInt;
  68. var
  69.   pn8:pint8 absolute p;
  70.   n8:int8;
  71.   ix:PtrInt absolute p;
  72.   i, cnt, BlockCount:PtrInt;
  73. begin
  74.   Result := 0;
  75.  
  76.   { Handle any initial misaligned bytes. }
  77.   cnt := (not (ix-1)) and $F;
  78.   if cnt>ByteCount then
  79.     cnt := ByteCount;
  80.   for i := 1 to cnt do
  81.   begin
  82.     n8 := pn8^;
  83.  
  84.     { Is this byte NOT the first byte of a character? }
  85.     Result += (n8 shr 7) and ((not n8) shr 6);
  86.     inc(pn8);
  87.   end;
  88.  
  89.   cnt := ByteCount - cnt;
  90.   if cnt=0 then
  91.     exit(ByteCount - Result);
  92.  
  93.   { Handle blocks of 16 bytes }
  94.   BlockCount := cnt div $10;
  95.   if BlockCount>0 then
  96.     Result += UTF8Len_aligned16(pchar(pn8), BlockCount);
  97.  
  98.   { Take care of any left-over bytes. }
  99.   inc(pn8, BlockCount shl 4);
  100.   for i := 1 to cnt and $F do
  101.   begin
  102.     n8 := pn8^;
  103.  
  104.     { Is this byte NOT the first byte of a character? }
  105.     Result += (n8 shr 7) and ((not n8) shr 6);
  106.     inc(pn8);
  107.   end;
  108.  
  109.   Result := (ByteCount - Result);
  110. end;
  111.  
  112. function UTF8Len_SIMD(const s: string): PtrInt;
  113. begin
  114.   Result := UTF8Len_SIMD(@s[1], Length(s));
  115. end;

1st Edit:
Thanks to shobits1, corrected the left-over loop pointer.

2nd Edit:
Thanks to MathMan, replaced xmm6/7 with xmm4/5.
« Last Edit: August 24, 2016, 09:21:42 pm by engkin »

 

TinyPortal © 2005-2018