Recent

Author Topic: [SOLVED]How to get the position of different characters in UTF8 strings  (Read 11444 times)

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: [SOLVED]How to get the position of different characters in UTF8 strings
« Reply #15 on: November 11, 2017, 02:02:57 pm »
I added a return value, it's more convenient to judge in the function than outside the function. Also added a security check.

Code: Pascal  [Select][+][-]
  1. // First index is 1, last index is Length(Str)
  2. // Return value: True is found, False is not found
  3. function UTF8DiffBytePos(Str1, Str2: string; var Start1, Start2: integer; Reverse: boolean = False): boolean;
  4.  
  5.   procedure GoToCpStartStr1;
  6.   var
  7.     b: byte;
  8.   begin  // Go to beginning of UTF8 Codepoint in Str1
  9.     while Start1 > 0 begin // Start1 will <=0 If UTF8 encoding is invalid
  10.       b := Ord(Str1[Start1]) shr 6;
  11.       if (b = 3) or (b shr 1 = 0) then
  12.         break;
  13.       Dec(Start1);
  14.     end;
  15.   end;
  16.  
  17.   procedure GoToCpStartStr2;
  18.   var
  19.     b: byte;
  20.   begin  // Go to beginning of UTF8 Codepoint in Str2
  21.     while Start2 > 0 do begin // Start2 will <=0 If UTF8 encoding is invalid
  22.       b := Ord(Str2[Start2]) shr 6;
  23.       if (b = 3) or (b shr 1 = 0) then
  24.         break;
  25.       Dec(Start2);
  26.     end;
  27.   end;
  28.  
  29. begin  Result := False;
  30.   if (Start1 <= 0) or (Start2 <= 0) or (Start1 > Str1.Length) or (Start2 > Str2.Length) then Exit;
  31.  
  32.   if Reverse then begin
  33.     while (Start1 >= 1) and (Start2 >= 1) and (Str1[Start1] = Str2[Start2]) do begin
  34.       Dec(Start1);
  35.       Dec(Start2);
  36.     end;
  37.     if Start1 > 1 then
  38.       GoToCpStartStr1;
  39.     if Start2 > 1 then
  40.       GoToCpStartStr2;
  41.     Result := (Start1 > 0) and (Start2 > 0);
  42.   end else begin
  43.     while (Start1 <= Str1.Length) and (Start2 <= Str2.Length) and (Str1[Start1] = Str2[Start2]) do begin
  44.       Inc(Start1);
  45.       Inc(Start2);
  46.     end;
  47.     if Start1 <= Str1.Length then
  48.       GoToCpStartStr1;
  49.     if Start2 <= Str2.Length then
  50.       GoToCpStartStr2;        
  51.     Result := (Start1 <= Str1.Length) and (Start2 <= Str2.Length);
  52.   end;
  53. end;
  54.  
  55. // First index is 1, last index is UTF8Length(Str)
  56. // Return value: True is found, False is not found
  57. function UTF8Diff(Str1, Str2: string; var Start1, Start2: integer; Reverse: boolean = False): boolean;
  58. begin
  59.   if not Reverse then begin
  60.     Dec(Start1);
  61.     Dec(Start2);
  62.   end;
  63.  
  64.   Start1 := UTF8CharToByteIndex(PChar(Str1), Str1.Length, Start1);
  65.   Start2 := UTF8CharToByteIndex(PChar(Str2), Str2.Length, Start2);
  66.  
  67.   if not Reverse then begin
  68.     Inc(Start1);
  69.     Inc(Start2);
  70.   end;
  71.  
  72.   Result := UTF8DiffBytePos(Str1, Str2, Start1, Start2, Reverse);
  73.  
  74.   if Start1 > 0 then Start1 := UTF8Length(PChar(Str1), Start1 - 1) + 1;
  75.   if Start2 > 0 then Start2 := UTF8Length(PChar(Str2), Start2 - 1) + 1;
  76. end;
« Last Edit: November 11, 2017, 03:01:16 pm by tomitomy »

totya

  • Hero Member
  • *****
  • Posts: 722
I added a return value, it's more convenient to judge in the function than outside the function. Also added a security check.

It sounds good, but your code doesn't work, while JuhaManninen code works.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
Re: How to get the position of different characters in UTF8 strings
« Reply #17 on: July 07, 2019, 03:31:16 pm »
Why is it that programmers usually are careful about their program logic and try to prevent bugs, but with Unicode serious bugs are OK?
Somebody please explain.

Because it is complex and they don't understand. Or better: they think they understand. It took me quite some time to understand it myself, mostly that you cannot fit all code points in a single UTF32 char. When I got that, it became clear.

And yes, multibyte char sets, and the different 16-bit Windows ones are a mess.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: How to get the position of different characters in UTF8 strings
« Reply #18 on: July 07, 2019, 06:02:26 pm »
Why is it that programmers usually are careful about their program logic and try to prevent bugs, but with Unicode serious bugs are OK?
Somebody please explain.

Because it is complex and they don't understand. Or better: they think they understand. It took me quite some time to understand it myself, mostly that you cannot fit all code points in a single UTF32 char. When I got that, it became clear.

And yes, multibyte char sets, and the different 16-bit Windows ones are a mess.
Any example to show "that you cannot fit all code points in a single UTF32 char"?

Thaddy

  • Hero Member
  • *****
  • Posts: 18764
  • To Europe: simply sell USA bonds: dollar collapses
Re: How to get the position of different characters in UTF8 strings
« Reply #19 on: July 07, 2019, 06:30:39 pm »
Any example to show "that you cannot fit all code points in a single UTF32 char"?
Yes, I would be very interested to see that.... 8-)

https://en.wikipedia.org/wiki/UTF-32

It is constant time....., like Ascii and UCS2 but unlike UTF8 and UTF16 which are not constant time.

I think it was a typo...
« Last Edit: July 07, 2019, 06:37:12 pm by Thaddy »
If Europe sells their USA bonds the USD will collapse. Europe can affort that given average state debts. The USA can't affort that. Just an advice...

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
No. You can create your own glyphs, by adding any amount of diacritics to any other glyph. That won't fit, there is no single code for every combination.

I'll repeat: not every unicode glyph (code point) fits in a single UTF32 char (32-bit value).

Edit: it doesn't matter what the Wikipedia says about it. That's merely the viewpoint of the page maintainer. It doesn't mean that the content is actually true.
« Last Edit: July 09, 2019, 02:47:34 pm by SymbolicFrank »

lucamar

  • Hero Member
  • *****
  • Posts: 4217
No. You can create your own glyphs, by adding any amount of diacritics to any other glyph. That won't fit, there is no single code for every combination.

I'll repeat: not every unicode glyph (code point) fits in a single UTF32 char (32-bit value).

You're making a very common mistake: confussing code-points and characters (or "glyphs" as you call them). Each of those diacratics you talk about is a single codepoint, but they may combine with other code-points to form a single character. In that aspect, yes: to form some characters you may need more than 32 bit ... but that is because you need (or use) more than one code-point.

Google "codepoint vs character": you'll find lots of references (some even correct :)) about this.
« Last Edit: July 09, 2019, 03:40:17 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
That's why I was talking about glyphs. As you can see, the language is't keeping up with the technology, as we don't have the right words yet. And I'm pretty sure many people will see the base char, including all the added diacritics, as a single code point. Because it is.

Anyway, that glyph that takes up the space for a single, written char, doesn't have to fit in a single 32-bits value. Or, in other words: even UTF-32 is multi-byte (multi-char? multi-longword?) might require multiple units to represent a single glyph.

lucamar

  • Hero Member
  • *****
  • Posts: 4217
And I'm pretty sure many people will see the base char, including all the added diacritics, as a single code point. Because it is.

No, it isn't. It's a combination of various code-points. Let's get back in time: when there was only ASCII (and EBCDIC, but let's forget that) to display "á" you had to send: "a"+#08+"'" (apostrophe). That's obviously one "character" or "glyph" but it needs three code-points. See the difference?

Quote
Anyway, that glyph that takes up the space for a single, written char, doesn't have to fit in a single 32-bits value. Or, in other words: even UTF-32 is multi-byte (multi-char? multi-longword?) might require multiple units to represent a single glyph.

Yes, that's right: a single "glyph" may require more than one code-point. And one has to be very aware of what language is represented and what the combining rules for that language are.

Also note that in some languages a "single character" may in fact be itself composed of various glyphs; canonical example:
  ᄀᄀᄀ각ᆨᆨ
That's what Unicode calls a "grapheme cluster", very basically a multi-glyph single-character. Or viceversa. Or whatever ... the terminology starts failing once we come this far :)
« Last Edit: July 09, 2019, 04:00:06 pm by lucamar »
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
If you have a glyph with one or more diacritics added and you remove that glyph, all of a sudden those diacritics are added to the glyph before it. Ergo, no matter how you call it, it's a single unit.

Otherwise, I agree.

engkin

  • Hero Member
  • *****
  • Posts: 3112
By the way, diacritics are glyphs as well. You can have separate diacritics without any other glyphs. You are aware of that, right?

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
No, it isn't. It's a combination of various code-points. Let's get back in time: when there was only ASCII (and EBCDIC, but let's forget that) to display "á" you had to send: "a"+#08+"'" (apostrophe). That's obviously one "character" or "glyph" but it needs three code-points. See the difference?

You're mixing up different things: keys you press on the keyboard, key events, code pages and the final resulting char.

Quote
Also note that in some languages a "single character" may in fact be itself composed of various glyphs; canonical example:
  ᄀᄀᄀ각ᆨᆨ
That's what Unicode calls a "grapheme cluster", very basically a multi-glyph single-character. Or viceversa. Or whatever ... the terminology starts failing once we come this far :)

Yes, it's a mess.

By the way, diacritics are glyphs as well. You can have separate diacritics without any other glyphs. You are aware of that, right?

Words fail me. Literally.

Everything that makes up a single glyph, chars, code points, other glyphs, all goes together to form a single HOWEVER YOU WANT TO CALL IT. As long as we agree, that those bytes make up a single symbol with a unique interpretation, and if you remove parts of that, the meaning of the sequence it is a part of changes.

Again, it is a big mess, designed by committees. No engineers were involved.

And the amount of applications that handles Unicode correctly is 0 (zero). Simply because it is vast, ambiguous and a moving target. And the retainers don't know what they're doing.

I'm done.

lucamar

  • Hero Member
  • *****
  • Posts: 4217
You're mixing up different things: keys you press on the keyboard, key events, code pages and the final resulting char.

No. I was talking about literally having to do "write(Output, 'a'#08'''')" (with the terminal in "overwrite" mode) to produce an aparent "'á". Nothing to do with keys, keyboards or codepages.

It was not exactly the same but it is a valid approximation: three codepoints (in the ASCII 0..$7F range) for one "glyph".
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
You forgot the underline and bright.

lucamar

  • Hero Member
  • *****
  • Posts: 4217
You forgot the underline and bright.

Ha. Ha. Good joke. :(
Let's leave it at that. "...the gods themselves..." etc.
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

 

TinyPortal © 2005-2018