Recent

Author Topic: UTF8Length: A Faster Implementation.  (Read 32502 times)

engkin

  • Hero Member
  • *****
  • Posts: 3112
UTF8Length: A Faster Implementation.
« on: August 20, 2016, 12:24:00 am »
I came across this page: Even faster UTF-8 character counting.

Could not resist trying it in Pascal:
Code: Pascal  [Select][+][-]
  1. //http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
  2. function UTF8Len(p: PChar; ByteCount: PtrInt): PtrInt;
  3. const
  4.   ONEMASK=$01010101;
  5. var
  6.   b:byte;
  7.   pu:PPtrUInt;
  8.   pb:pbyte absolute pu;
  9.   ui:PtrUInt absolute pu;
  10.   u:PtrUInt;
  11.   i:integer;
  12. begin
  13.   pu := PPtrUInt(p);
  14.   Result := 0;
  15.  
  16.   { Handle any initial misaligned bytes. }
  17.   for i := 1 to ui and (sizeof(u) - 1) do
  18.   begin
  19.     b := pb^;
  20.  
  21.     { Is this byte NOT the first byte of a character? }
  22.     Result += (b shr 7) and ((not b) shr 6);
  23.     inc(pb);
  24.   end;
  25.  
  26.   { Handle complete blocks. }
  27.   for i := 1 to ByteCount div sizeof(u) do
  28.   begin
  29.     u := pu^;
  30.  
  31.     { Count bytes which are NOT the first byte of a character. }
  32.     u := ((u and (ONEMASK * $80)) shr 7) and ((not u) shr 6);
  33.     Result += (u * ONEMASK) >> ((sizeof(u) - 1) * 8);
  34.     inc(pu);
  35.   end;
  36.  
  37.   { Take care of any left-over bytes. }
  38.   for i := 1 to (PtrUInt(p)+ByteCount) and (sizeof(u) - 1) do
  39.   begin
  40.     b := pb^;
  41.  
  42.     { Exit if we hit a zero byte. }
  43.     {if (b = $00) then
  44.       break;//}
  45.  
  46.     { Is this byte NOT the first byte of a character? }
  47.     Result += (b shr 7) and ((not b) shr 6);
  48.     inc(pb);
  49.   end;
  50.  
  51.   Result := (ByteCount - Result);
  52. end;

The usual Pascal overload:
Code: Pascal  [Select][+][-]
  1. function UTF8Len(const s: string): PtrInt;
  2. begin
  3.   Result := UTF8Len(@s[1], Length(s));
  4. end;

Benchmark results on an old 32bit computer:
Quote
Length:                1 bytes
UTF8Length-1:          1 in       2374 ticks
UTF8Length-2:          1 in       1242 ticks, ratio: 1.91

Length:                2 bytes
UTF8Length-1:          2 in        319 ticks
UTF8Length-2:          2 in        219 ticks, ratio: 1.46

Length:                3 bytes
UTF8Length-1:          3 in        364 ticks
UTF8Length-2:          3 in        199 ticks, ratio: 1.83

Length:                4 bytes
UTF8Length-1:          4 in        309 ticks
UTF8Length-2:          4 in        182 ticks, ratio: 1.70

Length:                5 bytes
UTF8Length-1:          5 in        391 ticks
UTF8Length-2:          5 in        168 ticks, ratio: 2.33

Length:                6 bytes
UTF8Length-1:          3 in        317 ticks
UTF8Length-2:          3 in        190 ticks, ratio: 1.67

Length:               15 bytes
UTF8Length-1:          5 in        824 ticks
UTF8Length-2:          5 in        231 ticks, ratio: 3.57

Length:               26 bytes
UTF8Length-1:         13 in        857 ticks
UTF8Length-2:         13 in        324 ticks, ratio: 2.65

The ratio for 20 MB long string is more than five times. I would assume it should be faster on a 64bit system by around 3 to 11 times depending on the length of text.

The code I used for the test:
Code: Pascal  [Select][+][-]
  1.   TestUTF8LengthSpeed('a');
  2.   TestUTF8LengthSpeed('ab');
  3.   TestUTF8LengthSpeed('abc');
  4.   TestUTF8LengthSpeed('abcd');
  5.   TestUTF8LengthSpeed('abcde');
  6.   TestUTF8LengthSpeed('βγδ');
  7.   TestUTF8LengthSpeed('こんにちは');
  8.   TestUTF8LengthSpeed('βγδεζηθβγδβγδ');
  9.  
  10.   FileStream:=TFileStream.Create('big-utf8-file.txt',fmOpenRead);
  11.   try
  12.    SetLength(s,FileStream.Size);
  13.    FileStream.Read(s[1],FileStream.Size);
  14.   finally
  15.    FileStream.Free;
  16.   end;
  17.  
  18.   TestUTF8LengthSpeed(s);
  19.  

Where TestUTF8LengthSpeed is:
Code: Pascal  [Select][+][-]
  1. {$AsmMode intel}
  2. function HardwareTicks: Int64; assembler; asm DW 0310FH end;
  3.  
  4. procedure TestUTF8LengthSpeed(const s: string);
  5. var
  6.   l1, l2: PtrInt;
  7.   tt1, tt2, st, ed: Int64;
  8. begin
  9.   WriteLn('Length: ',Length(s):16,' bytes');
  10.  
  11.   st := HardwareTicks;
  12.   l1 := UTF8Length(s);
  13.   ed := HardwareTicks;
  14.   tt1 := (ed-st);
  15.   WriteLn('UTF8Length-1: ',l1:10,' in ',tt1:10,' ticks');
  16.  
  17.   st := HardwareTicks;
  18.   l2 := UTF8Len(s);
  19.   ed := HardwareTicks;
  20.   tt2 := (ed-st);
  21.   WriteLn('UTF8Length-2: ',l2:10,' in ',tt2:10,' ticks, ratio: ',tt1/tt2:1:2);
  22.   WriteLn();
  23. end;

Thaddy

  • Hero Member
  • *****
  • Posts: 17413
  • Ceterum censeo Trumpum esse delendum (Tnx Charlie)
Re: UTF8Length: A Faster Implementation.
« Reply #1 on: August 20, 2016, 08:56:57 am »
That code is wrong because as far as I can see it doesn't cover full-length 4 byte codepoints (as per UTF8 specification!) . Basically useless.
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #2 on: August 20, 2016, 11:25:41 am »
Code: Pascal  [Select][+][-]
  1. const
  2.   ONEMASK=$01010101;
Is this correct? ONEMASK is used with PtrUInt which matches the size of Pointer, potentially 64 bits.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 17413
  • Ceterum censeo Trumpum esse delendum (Tnx Charlie)
Re: UTF8Length: A Faster Implementation.
« Reply #3 on: August 20, 2016, 12:59:41 pm »
There's more wrong than just that.
Pascal's own UTF8String is MUCH faster. This code would only help for C style strings that depend on the stupid strlen function (yes, even if Pascal's length() - that calls out to strlen for c strings - is used it will run in O(n), not O(1) which is the case for UTF8String that stores its length at a negative offset. It is silly code made up for C programmers. Also the tests should test border cases, not normal cases. That is a false sense of proof.
« Last Edit: August 20, 2016, 01:05:31 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #4 on: August 20, 2016, 01:59:57 pm »
Pascal's own UTF8String is MUCH faster. This code would only help for C style strings that depend on the stupid strlen function (yes, even if Pascal's length() - that calls out to strlen for c strings - is used it will run in O(n), not O(1) which is the case for UTF8String that stores its length at a negative offset. It is silly code made up for C programmers. Also the tests should test border cases, not normal cases. That is a false sense of proof.

Thaddy, you have serious gaps in your knowledge about Unicode. Please learn its concepts first.
The proposed function, UTF8Len, counts codepoints just like UTF8Length but faster. UTF8String does not store the count of codepoints at a negative offset or anywhere else. Counting them requires iterating over the string. Please look at how UTF8Length does it.

Another issue is that counting codepoints of a long string is not very useful by itself. The author of the "Even faster UTF-8 character counting" has a valid comment in the end:
"Well, the first rule of optimization is to start by finding a good algorithm -- and any algorithm in which the critical path involves counting UTF-8 characters in a 32 megabyte NUL-terminated string is doing something wrong. This is very much a toy problem; but the lesson it teaches is worth remembering: Vectorization is good!"
He uses a term "UTF-8 characters" while he should use "UTF-8 codepoints" but that is a small issue.
He also presents a further idea of using SSE2 in modern Intel CPUs which could allow a super-super fast implementation.

For the reasons mentioned, optimizing UTF8CharacterLength is more useful than optimizing UTF8Length. Iterating codepoints is needed in many algorithm, for example for searching text in word boundaries and returning their codepoint positions.
I plan to apply the version with "case" presented elsewhere.
UTF8Length is also needed sometimes and a faster version can be applied. The optimized versions don't check the validity of UTF-8 which is OK in most use-cases, but it means the original functions must stay as they are.
« Last Edit: August 20, 2016, 10:44:34 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 17413
  • Ceterum censeo Trumpum esse delendum (Tnx Charlie)
Re: UTF8Length: A Faster Implementation.
« Reply #5 on: August 20, 2016, 02:18:54 pm »
If UTF-8 length is implemented - perhaps inevitable - in such a way and does away with the whole concept of a Pascal string the choice for utf-8 is even worse than I thought..
But I indeed lack utf-8 knowledge. Though I know enough about the other unicode encodings like 16 and 32.
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #6 on: August 20, 2016, 03:01:31 pm »
If UTF-8 length is implemented - perhaps inevitable - in such a way and does away with the whole concept of a Pascal string the choice for utf-8 is even worse than I thought..
But I indeed lack utf-8 knowledge. Though I know enough about the other unicode encodings like 16 and 32.

Do you? UTF-16 requires iterating codepoints for counting them just like UTF-8 does. UTF-32 is a fixed-width encoding but how useful is that? As mentioned earlier, counting codepoints is not needed often. Most of the time codeunit offsets can be used. Strings are copied and passed around, sometimes throug AnsiLowercase() and so on. Then UTF-32 only hogs memory and slows things down, also by causing more cache misses.

Codepoints are easy to handle in any case. The true complexity of Unicode is in combined codepoints and their complex rules. They form the "Unicode characters". It does not depend on encoding at all.

There are certain types of applications that must care about Unicode more than others. Code for text editor and text layout are good examples.
There is still plenty of code out there that claims to support UTF-16 but does not. This forum SW is a good example. Just copy to your post any of the > 50000 characters that require a surrogate pair in UTF-16 and see what happens. It is broken, it does not support Unicode.
The reasons are in history. People still treat UTF-16 as a fixed width encoding. :(
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #7 on: August 20, 2016, 03:25:26 pm »
That code is wrong because as far as I can see it doesn't cover full-length 4 byte codepoints (as per UTF8 specification!) . Basically useless.

Here is the thing, the text should follow the UTF8 specification then it is possible to count its codepoints using this code.

Do you have any UTF8 text that this code fails to count its codepoints?

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #8 on: August 20, 2016, 03:29:44 pm »
Code: Pascal  [Select][+][-]
  1. const
  2.   ONEMASK=$01010101;
Is this correct? ONEMASK is used with PtrUInt which matches the size of Pointer, potentially 64 bits.

You are right, Juha. I did not test it on a 64bit system. Maybe ONEMASK should be changed to?
Code: Pascal  [Select][+][-]
  1.   ONEMASK=$01010101{$IfDef CPU64}01010101{$EndIf};

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #9 on: August 20, 2016, 03:48:48 pm »
Another issue is that counting codepoints of a long string is not very useful by itself. The author of the "Even faster UTF-8 character counting" code has a valid comment in the end

Yes, that is true. I simply wanted to test the idea in Pascal and share it with everyone here. I was hoping to set some achievable speed goal for UTF8Length.

I know that code like the one presented here will not be accepted, usually. Readability, and maintainability are good but in rare occasions they form an obstacle to produce fast code like.

For the reasons mentioned, optimizing UTF8CharacterLength is more useful than optimizing UTF8Length.

No doubt about that.

UTF8Length is also needed sometimes and a faster version can be applied. The optimized versions don't check the validity of UTF-8 which is OK in most use-cases, but it means the original functions must stay as they are.

I agree.

Fungus

  • Sr. Member
  • ****
  • Posts: 354
Re: UTF8Length: A Faster Implementation.
« Reply #10 on: August 20, 2016, 07:36:54 pm »
I was just wondering; To determine the size of a UTF8 code point the only thing you have to do is to detect how many high-order bits are "1" before you hit a zero, right?

0xxxxxxx = 1 byte (basic 7 bit ascii)
110xxxxx = 2 byte
1110xxxx = 3 byte

and so on.. Like described here: https://en.wikipedia.org/wiki/UTF-8#Description

If that is the case it really should not be that difficult to create an efficient UTF8Length routine..
« Last Edit: August 20, 2016, 07:47:28 pm by Fungus »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4599
  • I like bugs.
Re: UTF8Length: A Faster Implementation.
« Reply #11 on: August 20, 2016, 11:07:34 pm »
If that is the case it really should not be that difficult to create an efficient UTF8Length routine..

Yes, the current UTF8Length is also relatively efficient. It iterates byte by byte and calls UTF8CharacterLength which also checks the validity.
The UTF8Len proposed by engkin is extremely fast because it assumes the data is valid (no check) and it uses the CPU's native data size. In a 64-bit CPU it means 8 bytes at once. The logic operations required for it are rather difficult to follow.
:)
It would be fascinating to have an IFDEFed assembly version with SIMD instructions for CPUs that support it. How much faster would it be?
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

shobits1

  • Sr. Member
  • ****
  • Posts: 271
  • .
Re: UTF8Length: A Faster Implementation.
« Reply #12 on: August 21, 2016, 12:40:45 am »
I have been testing and found just by inlining (add inline;) to UTF8CharacterLength function would improve the performance by 40% that means the same function would use ~0.7 of the time needed without inline;

Is there any reason why not do this ??

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UTF8Length: A Faster Implementation.
« Reply #13 on: August 21, 2016, 01:07:13 am »
I was just wondering; To determine the size of a UTF8 code point the only thing you have to do is to detect how many high-order bits are "1" before you hit a zero, right?

Yes, I tried to come up with a faster code here. If you read the rest of that thread, you will see that ondrejpokorny suggested a better solution.

If that is the case it really should not be that difficult to create an efficient UTF8Length routine..

That's what we would like to achieve. ;-)

I have been testing and found just by inlining (add inline;) to UTF8CharacterLength function would improve the performance by 40% that means the same function would use ~0.7 of the time needed without inline;

Is there any reason why not do this ??

For a function like UTF8CharacterLength, I don't see a reason not to use inlining. Could you repeat your tests with 3 or 4byte codepoint text? Simple 1byte codepoints are going to be fast, but it slows down with, lets say, Japanese text, I guess.

Edit:
On the other hand, inlining is not guaranteed, so what code did you test?
« Last Edit: August 21, 2016, 01:15:58 am by engkin »

shobits1

  • Sr. Member
  • ****
  • Posts: 271
  • .
Re: UTF8Length: A Faster Implementation.
« Reply #14 on: August 21, 2016, 03:18:33 am »
Here is the result when applying inline for different scenario:

Quote
  file of :                  time (compared to non inline)      performance improvement
  $  (1 code unit)          0.69                                         45%
  ¢  (2 code unit)          0.73                                         37%
  €  (3 code unit)          0.78                                         28%
  𐍈  (4 code unit)          0.75                                         32%
  $¢€𐍈¢𐍈𐍈€ (mixed)     0.85                                         18%

So, I think adding inline by default is a must.
« Last Edit: August 21, 2016, 03:20:35 am by shobits1 »

 

TinyPortal © 2005-2018