Recent

Author Topic: How should I manipulate UTF8 strings  (Read 2482 times)

hakelm

  • Full Member
  • ***
  • Posts: 153
How should I manipulate UTF8 strings
« on: December 20, 2017, 04:35:03 pm »
I want to manipulate strings containing two byte UTF8 character symbols and I am at a loss on how to proceed. Some of my problems are illustrated in the little program below.
The writeln statement produces no visible output.
Running hexdump on the two files produced produces:
hexdump utfbug0.txt
0000000 b6c3 b6c3 b6c3                         
hexdump utfbug1.txt
0000000 b6b6 b6c3 b6c3 
so sr[1]:=s[2]; doesn't work as expected
and neither can I do s[1]:='ö'; which produces a compiler error.
So it seems that accessing a string symbol with a statement like s[n] only accesses the nth byte.
What have I misunderstood and what can I do to access and manipulate individual character symbols?
Thanks in advance for any help
H

uses
  {$IFDEF UNIX}{$IFDEF UseCThreads}
  cthreads,
  {$ENDIF}{$ENDIF}
  Classes,sysutils
  { you can add units after this };

procedure savestring(fname, s: string);
var str:tfilestream; p:pbyte;
begin
  str:=tfilestream.Create(fname,fmcreate);
  p:=@s[1];
  while p^<>0 do begin
    str.Write(p^,1);
    inc(p)
  end;
  str.Free;
end;



var s,sn,sr:string; n:integer;  p:pbyte;
begin
  s:='äöä';
  writeln(s[1]);
  sr:=s;
  sr[1]:=s[2];
  savestring('utfbug0.txt',s);
  savestring('utfbug1.txt',sr);
end.
             

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: How should I manipulate UTF8 strings
« Reply #1 on: December 20, 2017, 05:19:38 pm »
Your utf8 "characters" need to be strings (they can be up to 4 bytes in length).
Something like this:

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$ifdef windows}
  4.   {$apptype console}
  5. {$endif}
  6. {$Mode objfpc}{$H+}
  7.  
  8. uses
  9.   LazUTF8;
  10.  
  11. var
  12.   s: String = 'äöä';
  13.   tmp: String;
  14.   pc: PChar;
  15.   len, start: Integer;
  16.   total: Integer = 0;
  17.  
  18. begin
  19.   pc:=PChar(s);
  20.   start:=1;
  21.   repeat
  22.     len:=UTF8CharacterLengthFast(pc);
  23.     Inc(total, len);
  24.  
  25.     tmp:=Copy(s, start, len);
  26.     WriteLn(tmp);
  27.  
  28.     Inc(pc, len);
  29.     Inc(start, len);
  30.   until total >= Length(s);
  31.   WriteLn('program ended, press [Enter] to exit');
  32.   ReadLn;
  33. end.

m.abudrais

  • Jr. Member
  • **
  • Posts: 52
Re: How should I manipulate UTF8 strings
« Reply #2 on: December 20, 2017, 05:36:23 pm »
« Last Edit: December 20, 2017, 05:39:00 pm by m.abudrais »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: How should I manipulate UTF8 strings
« Reply #3 on: December 20, 2017, 05:48:59 pm »
Yes, and don't forget unit LazUnicode. Instead of code from howardpc you can do:

Code: Pascal  [Select][+][-]
  1. program project1;
  2. {$Mode objfpc}{$H+}
  3. uses LazUnicode;
  4.  
  5. var
  6.   S: String = 'äöå';
  7.   tmp: String;
  8. begin
  9.   for tmp in S do
  10.     WriteLn(tmp);
  11. end.

Remember, the project must have LazUtils as dependency. Otherwise it cannot access neither LazUTF8 nor LazUnicode.
BTW, UTF8CharacterLengthFast is renamed in trunk to UTF8CodepointSizeFast. A "character" is a fuzzy term in Unicode and causes confusion.

Quote
so sr[1]:=s[2]; doesn't work as expected
Actually it works as expected. It copies one byte (CodeUnit in Unicode) which is still often useful.
See the link given by m.abudrais for examples.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

hakelm

  • Full Member
  • ***
  • Posts: 153
Re: How should I manipulate UTF8 strings
« Reply #4 on: December 20, 2017, 09:42:08 pm »
Thank you both of you.
I had hoped that the intricacies of UTF8 had been wrapped up in the traditional string routines.
Replacement, insertion, deletion etc. of character symbols now is a bit more complicated and perhaps error prone.
An object that realises these tasks could be of benefit to irregular programmers like me.
Is there such an animal or is there any advice on best string programming  around?
H

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: How should I manipulate UTF8 strings
« Reply #5 on: December 21, 2017, 10:28:39 am »
I had hoped that the intricacies of UTF8 had been wrapped up in the traditional string routines.
I don't know what "traditional" means in this context but all intricacies of UTF-8 have indeed been wrapped into functions in unit LazUTF8.
It is not really very intricate. Only CodePoints are encoded. The real complexity of Unicode is elsewhere.

Quote
Replacement, insertion, deletion etc. of character symbols now is a bit more complicated and perhaps error prone.
What do you mean by "character"?
The wiki link given to you shows that CodeUnit resolution is still very useful with Unicode.

Quote
An object that realises these tasks could be of benefit to irregular programmers like me.
Is there such an animal or is there any advice on best string programming  around?
When working with Unicode you must understand at least its basics. There is no shortcut.
« Last Edit: December 21, 2017, 11:55:40 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018