Unicode, and IntToStr and PosEx: How to make it work?

EganSolo

Sr. Member
Posts: 290

Unicode, and IntToStr and PosEx: How to make it work?

« on: February 19, 2017, 03:33:53 am »

So, I'm trying to use unicode... yeah...

I've set {$modeswitch UnicodeStrings} in my units, but I'm using IntToStr and the compiler complains about the implicit type conversion from AnsiString to UnicodeString.

I can get rid of this warning by explicit typecasting as in

Code: Pascal [Select][+]

UnicodeString(IntToStr())

Is this the right thing to do?

More importantly, when I set $modeswitch UnicodeStrings} it sets string to UnicodeString in my unit but not globally across all units.
Therefore SysUtils remains AnsiString.

Do I have to worry about every method in SysUtils now, such as Trim, TrimLeft, Pos, PosEx etc?

The other option would be to remove this switch and reinstate {$H+} and then convert UnicodeString into String...

Also, can I use {$modeswitch UnicodeStrings} without {$mode Delphi}?

This entire Unicode / ANSI code business smells like a messed-up marriage.

We need a counselor...

« Last Edit: February 19, 2017, 09:55:54 am by EganSolo »

Logged

EganSolo

Sr. Member
Posts: 290

Re: Unicode and IntToStr: Why the warning?

« Reply #1 on: February 19, 2017, 09:55:10 am »

OK, so I'm doing a bit more digging...
Consider the following simple program

Code: Pascal [Select][+]

program Project1;
{$modeswitch UnicodeStrings}
uses sysutils,Classes,strutils;
var S1, S2 : String;
    S3     : String;
    i,j    : integer;
begin
  S1 := 'Bonjour Sérénità';
  S3 := 'à';
  i  := Pos('Bonjour', S1);
  J  := PosEx(S3,S1,i);
  WriteLn('i = ', i, ' j = ', j);
  Readln();
end.
 

When you run it, j = 18, when in fact, it should be 15.

Also, consider this other bit of code:

Code: Pascal [Select][+]

program Project1;
{$modeswitch UnicodeStrings}
uses sysutils,Classes,strutils;
var S1, S2 : String;
    C      : Char  ;
    i,j    : integer;
begin
  S1 := 'Bonjour Sérénità';
  C  := WideChar('à');
  i  := Pos('Bonjour', S1);
  J  := PosEx(C,S1,i);
  WriteLn('i = ', i, ' j = ', j);
  Readln();
end.
 

This code does not compile. I get an error because the compiler can't find a suitable PosEx function for the widechar:
project1.lpr(11,9) Error: Can't determine which overloaded function to call

Any suggestions on how to fix this?

Logged

Thaddy

Hero Member
Posts: 14380
Sensorship about opinions does not belong here.

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #2 on: February 19, 2017, 10:30:14 am »

That code compiles with trunk.
I seem to remember that the particular posex was also backported to 3.0.2.
Which version of FPC are you using? Try upgrading to 3.0.2 first.

[edit]

I verified that it is indeed back-ported to 3.0.2.
So upgrade to 3.0.2 and your code works as expected.

« Last Edit: February 19, 2017, 10:50:53 am by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

JuhaManninen

Global Moderator
Hero Member
Posts: 4474
I like bugs.

Re: Unicode and IntToStr: Why the warning?

« Reply #3 on: February 19, 2017, 12:08:49 pm »

Quote from: EganSolo on February 19, 2017, 09:55:10 am

When you run it, j = 18, when in fact, it should be 15.

No, Pos() returns a byte position and 18 is correct. UTF8Pos() would return 15.
Often you can use byte positions also with UTF-8 data. See examples here:
http://wiki.freepascal.org/UTF8_strings_and_characters

Please also consider unit LazUnicode for truly portable code dealing with strings :
http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code

Logged

Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

wp

Hero Member
Posts: 11923

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #4 on: February 19, 2017, 12:42:38 pm »

Juha, but he has {$modeswitch unicodestrings}. Should this make a string to a unicodestring?

Logged

JuhaManninen

Global Moderator
Hero Member
Posts: 4474
I like bugs.

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #5 on: February 19, 2017, 01:47:54 pm »

Quote from: wp on February 19, 2017, 12:42:38 pm

Juha, but he has {$modeswitch unicodestrings}. Should this make a string to a unicodestring?

I think the first example was edited and {$modeswitch unicodestrings} was added. My answer is valid only if the code is without the modeswitch and is used with the Lazarus Unicode system.
For example LCL does not work well with {$modeswitch unicodestrings} currently. With pure FPC programs it is OK.

Logged

Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Thaddy

Hero Member
Posts: 14380
Sensorship about opinions does not belong here.

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #6 on: February 19, 2017, 02:19:54 pm »

Juha, the provided examples ARE FPC only and I have verified that they only work with 3.0.2 or trunk and not with 3.0.0. Lazarus has nothing to do with it.
It's plain UTF16 here.

The only thing I did not check here is a run against Delphi 2010+ to see if the results are the same. (But I did so after the initial bug fix)
The index returned is 16, which is correct. It is a character index and NOT a byte index.
There is a minor issue with the cast as widechar, that is not necessary - but it does compile as well - , because it is already unicodechar of course....

Code: Pascal [Select][+]

program untitled;
{$mode delphiunicode}  // I would do that...
uses sysutils,Classes,strutils;
var S1, S2 : String;
    C      : Char  ;
    i,j    : integer;
begin
  S1 := 'Bonjour Sérénità';
  C  := 'à';
  i  := Pos('Bonjour', S1);
  J  := PosEx(C,S1,i);
  WriteLn('i = ', i, ' j = ', j);
  Readln;
end.

« Last Edit: February 19, 2017, 03:03:17 pm by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

EganSolo

Sr. Member
Posts: 290

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #7 on: February 20, 2017, 12:01:52 am »

Thanks everyone for your feedback.
Downloaded fpc 3.0.2. Recompiled Laz 1.6.2 with it. Reran the following code:
I've extended my test code a bit more. You can copy this code and drop it inside a simple command line project, compile and run.

Code: Pascal [Select][+]

program Unicode_and_pos_ex;
{$modeswitch UnicodeStrings} //Switching String to UnicodeString
uses sysutils,Classes,strutils;
var S1,S2  : String;
    S3     : String;
    C1,C2  : Char  ;
    i,j,k,l: integer;
begin
  S1 := 'Bonjour Sérénitàa';
  S3 := 'à';
//  C  := 'à';       //<== Error: Incompatible types: got "Constant String" expected "WideChar"
  C1 := 'a';         //This compiles just fine. Why the difference?
  C2 := Char('à');   //this compiles.
  i  := Pos('Bonjour', S1);
  J  := PosEx(S3,S1,i);
  K  := PosEx(C1,S1,i);
  L  := PosEx(C2,S1,i);
  WriteLn('i = ', i, ', j = ', j, ', k = ' , k, ', l = ', l);
 
 
  S2 := '     ' + S1 + '    ';
  S2 := Trim(S2);
  If S1 = S2
  then Writeln('Trim works')
  else Writeln('Trim is broken');
 
 
  S2 := UpperCase(S1);
  Writeln('Uppercase S1 = ', S2);
  S2 := Lowercase(S2);
  Writeln('Lowercase S2 = ', S2);
  If S2 = S1
  then Writeln('Uppercase / Lowercase work')
  else Writeln('Uppercase / Lowercase broken');
 
 
  i   := Pos(' ',S1);
  S2  := Copy(S1,i+1,Length(S1));
  If S2 = 'Sérénitàa'
  then writeln('copy works')
  else writeln('copy broken');
  Readln();
 
 
 {returns:
    i = 1, j = 18, k = 20, l = 0
    Trim works
    Uppercase S1 = BONJOUR SAcRAcNITA A
    Lowercase S2 = bonjour sAcrAcnitA a
    Uppercase / Lowercase broken
    copy works
 }
end.
 

Observations:

The compiler chocks on C2 := ' à' but not on C1 := 'a'?
PosEx for S3 returns 16 (correctly )
PosEx for C1 returns 20 for the position of 'a' but there aren't 20 characters in the string
PosEx returns 0 for the position of 'à' with C2
It would seem that UpperCase is broken? Perhaps I'm not using it right?

So, 3.0.2 is a definite improvement over 3.0.0 but I think the PosEx for Char is still broken. As a workaround, we should be using PosEx with strings only until the PosEx with Char is fixed. Also, UpperCase is having issues or maybe I'm not understanding how to use it appropriately?

Logged

EganSolo

Sr. Member
Posts: 290

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #8 on: February 20, 2017, 08:57:14 am »

Here is another update.
I've recompiled the RegExp unit with {$modeswitch UnicodeStrings}. To do this, I copied the unit and insert the mode switch inside the unit. This unit has a few defines to turn Unicode support on such as
FPS_OS_UNICODE and UNICODE but the problem is that even with these switches turned on I was still receiving warnings about implicit type conversions between AnsiString and WideString. By switching UnicodeStrings on, these warnings went away.

To run this test, include the RegExp unit first, then open it and save it as UnicodeRegExp. Remove RegExp and keep UnicodeRegExp. Add {$modeswitch UnicodeStrings} to UnicodeRegExp and then compile your program as below.

The following code below works as expected with unicode strings, which is great news.

Code: Pascal [Select][+]

program TryUnicode;
{$modeswitch UnicodeStrings}
{$codepage utf-8} //We need this otherwise the test will fail spectacularly.
uses
 UnicodeRegExp
 ;
 
var S1, S2: String;
    RegExp : TRegExpr;
    c      : Char;
    Success: integer;
    failure: integer;
    aCount : integer;
begin
  //First a straightforward test.
  S1 := '[o]+';
  S2 := 'book';
  RegExp := TREgExpr.Create(S1);
  RegExp.Compile;
  If RegExp.Exec(S2)
  then writeln('test works')
  else writeln('test fails');
  //A second, simple unicode enalbed test
  S1 := '[à]+';
  S2 := 'bààk';
  RegExp.Expression := S1;
  RegExp.Compile;
  If RegExp.Exec(S2)
  then writeln('test works')
  else writeln('test fails');
 
  //A more elaborate unicode test
  {From the unicode table:
     U+00E0     à      c3 a0   LATIN SMALL LETTER A WITH GRAVE
     U+00E1     á      c3 a1   LATIN SMALL LETTER A WITH ACUTE
     U+00E2     â      c3 a2   LATIN SMALL LETTER A WITH CIRCUMFLEX
     U+00E3     ã      c3 a3   LATIN SMALL LETTER A WITH TILDE
     U+00E4     ä      c3 a4   LATIN SMALL LETTER A WITH DIAERESIS
     U+00E5     å      c3 a5   LATIN SMALL LETTER A WITH RING ABOVE
     U+00E6     æ      c3 a6   LATIN SMALL LETTER AE
     U+00E7     ç      c3 a7   LATIN SMALL LETTER C WITH CEDILLA
     U+00E8     è      c3 a8   LATIN SMALL LETTER E WITH GRAVE
     U+00E9     é      c3 a9   LATIN SMALL LETTER E WITH ACUTE
  }
  S1 := '[à-é]+';
  RegExp.Expression := S1;
  RegExp.Compile;
  Success := 0;
  failure := 0;
  aCount  := 0;
  For c := Char('à') to Char('è') do
  begin
     Write(c, ' -- ');
     System.Inc(aCount);
     If aCount mod 20 = 0
     then Writeln;
     S2 := 'b' + c + 'k';
     If RegExp.Exec(S2)
     then System.Inc(Success)
     else System.Inc(Failure);
  end;
  Writeln('Successful match(es): ' , success);
  Writeln('Failed matches: ', failure);
  RegExp.free;
  Readln();
end.
 

Using Unicode strings is far more involved than using straight ansi Strings. Coders using non-English languages may already be familiar with all of this, but for some of us, using Unicode will prove arduous until we figure out exactly what needs to be done. For instance, if you do not include {$codepage utf-8} in the code above, it will fail. Figuring out that the French accented characters I chose require utf-8 is neither intuitive nor straightforward. If I could specify the code by language, that is if there were a directive such as {$Unicodelanguage French} then that would be far more easier to handle, but as it is, it seems like a hit or miss experience.

Still, I'm happy this is working.

More to come as I continue to explore the art of the possible with unicode.

Logged

Thaddy

Hero Member
Posts: 14380
Sensorship about opinions does not belong here.

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #9 on: February 20, 2017, 09:35:58 am »

One remark: Use {$mode delphiunicode} and NOT {$modeswitch unicodestrings}
There is more involved. A {$mode} is a whole set of {$modeswitches}

You want 16 bit unicode UTF16 (delphi unicode), not UTF8 (Lazarus unicode).

Once you do that, your problems disappear, mostly. (Like the C variable)

Also: you can read the documentation and the sourcecode if a UTF16 overload is already available... like for trim.
All of that will come in time. If it doesn't work, check the docs and check the sourcecode.

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

EganSolo

Sr. Member
Posts: 290

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #10 on: February 20, 2017, 10:02:30 pm »

Thaddy,

Thanks for the feedback. Please run the program below. First, enable {$modeswitch Unicodestrings} and {$mode utf-8}, then comment them out and enable {$mode delphiunicode}. You will readily see that the result with delphiunicode is far worse than the result with unicodestrings and utf-8. I may still be doing something wrong, though.

Code: Pascal [Select][+]

program Unicode_and_pos_ex;
//First run this program with UnicodeStrings and utf-8 enabled, then comment them out and enable {$mode delphiunicode}
//and run again.
{$modeswitch UnicodeStrings}
{$codepage utf-8}
//{$mode delphiunicode}
uses sysutils,strutils;
 
var S1,S2  : String;
    S3     : String;
    C1,C2  : Char  ;
 
Procedure InitVars;
begin
          {0        1       }
          {12345678901234567}
   S1 := 'bonjour sérénitàa';
   S3 := 'à';
   //  C  := 'à';       //<== Error: Incompatible types: got "Constant String" expected "WideChar"
   C1 := 'a';         //This compiles just fine. Why the difference?
   C2 := Char('à');   //this compiles.
end;
 
Procedure TestUnicodePos;
var i : integer;
begin
   Writeln('searching for the substring ''jour'' in ''bonjour sérénitàa''');
   i  := Pos('jour', S1);
   If i = 4
   then writeln('Pos works for non-accented searches')
   else writeln('Pos broken for non-accented searches. Got ', i, ' when expecting 4');
 
   Writeln('searching for the substring ''à'' in ''bonjour sérénitàa''');
   i  := Pos(S3 , S1);
   If i = 16
   then writeln('Pos works for string accented searches')
   else writeln('Pos broken for string accented searches. Got ', i, ' when expecting 16');
 
   Writeln('searching for the char ''à'' in ''bonjour sérénitàa''');
   i  := Pos(C2 , S1);
   If i = 16
   then writeln('Pos works for accented character searches')
   else writeln('Pos broken for accented character searches. Got ', i, ' when expecting 16');
end;
 
Procedure TestUnicodePosEx;
var i : integer;
begin
   Writeln('searching for the substring ''à'' in ''bonjour sérénitàa'' after pos 4');
   i  := PosEx(S3 , S1, 4);
   If i = 16
   then writeln('PosEx works for string accented searches')
   else writeln('PosEx broken for string accented searches. Got ', i, ' when expecting 16');
 
   Writeln('searching for the char ''à'' in ''bonjour sérénitàa'' after post 4.');
   i  := PosEx(C2 , S1, 4);
   If i = 16
   then writeln('PosEx works for accented character searches')
   else writeln('PosEx broken for accented character searches. Got ', i, ' when expecting 16');
end;
 
Procedure TestUnicodeTrimming;
const BlankPad = '      ';
begin
   S2 := BlankPad + S1 + BlankPad;
   S2 := Trim(S2);
   If S1 = S2
   then Writeln('Trim works')
   else Writeln('Trim is broken');
 
   S2 := BlankPad + S1;
   S2 := TrimLeft(S2);
   If S1 = S2
   then Writeln('TrimLeft works')
   else Writeln('TrimLeft is broken');
 
   S2 := S1 + BlankPad;
   S2 := TrimRight(S2);
   If S1 = S2
   then Writeln('RightTrim works')
   else Writeln('RightTrim is broken');
end;
 
Procedure TestUnicodeCopying;
var i: integer;
begin
   i   := Pos(' ',S1);
   S2  := Copy(S1,i+1,Length(S1));
   If S2 = 'sérénitàa'
   then writeln('copy works')
   else writeln('copy broken');
end;
 
Procedure TestUnicodeUpperLower;
begin
   Writeln('original string: ' , S1);
   S2 := UpperCase(S1);
   Writeln('Uppercase S1 = ', S2);
   If S2 = 'BONJOUR SÉRÉNITÀA'
   then Writeln('Uppercase works')
   else Writeln('Uppercase broken: got ', S2, ' instead of BONJOUR SÉRÉNITÀA');
   S2 := Lowercase(S2);
   Writeln('Lowercase S2 = ', S2);
   If S2 = S1
   then Writeln('Lowercase work')
   else Writeln('Lowercase broken: got ', S2, ' instead of bonjour sérénitàa');
end;
 
begin
 InitVars;
 TestUnicodePos;
 TestUnicodePosEx;
 TestUnicodeTrimming;
 TestUnicodeCopying;
 TestUnicodeUpperLower;
 Readln();
end.
 

« Last Edit: February 21, 2017, 12:49:53 am by EganSolo »

Logged

howardpc

Hero Member
Posts: 4144

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #11 on: February 20, 2017, 11:32:54 pm »

You're going to have to get used to the idea that unicode-encoded ansistrings such as the UTF8 encoding used by default in the Lazarus IDE editor is a multi-byte encoding.
Your equation
1 character = 1 byte
is a wrong assumption, except for the first few ANSI characters encoded in UTF8.

Pos() is not broken at all. It copes with multibyte-encoded strings, and it returns the byte position of the given 'character', which is only identical to its apparent 'character' position in the string if all characters are one-byte.
For instance, if you check Length(S1) in your example, you'll see that it is not 17 bytes, but 20 bytes, since three (accented) characters occupy two bytes in the string (not one byte like the other low-value characters). The visual representation of the string gives no clue as to the underlying storage requirement of the string encoding.
UTF8-encoded codepoints may require 1, 2, 3 or 4 bytes for each 'character' displayed. You can't tell by looking at a string display how many bytes are needed for each codepoint, you have to use the functions provided in LazUTF8 such as UTF8CharacterLength().

Logged

EganSolo

Sr. Member
Posts: 290

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #12 on: February 21, 2017, 12:13:08 am »

Hi Howard,

Thanks for your reply. I amended my code to include a test for length. It actually works as expected, returning 17 and not 20.
You might want to run this little program below to see what it does. Did you run your code with or without {$codepage utf-8}? Please see the code below.

I get that we need to use different codepages. The theory is simple. The practice is not: Here's what I am struggling with as I go through this gyration:

Should I use {$modeswitch unicodestrings} or {$mode delphiunicode} like Thaddy suggested?
Which string methods are supported out-of-the-box in 3.0.2. for unicode? Clearly Uppercase is not. Which other methods fail?
In either {$mode delphiunicode} or {$modeswitch unicodestrings} can I rely on TCharcter to figure out if a string is an identifier, a symbol, etc regardless of the actual code page?
What about collations? Do I need to use them if I'm using utf-8?

As you can see, the details are where there's a bump on the road, and I'd wager to say that I'm not the only one

Code: Pascal [Select][+]

program Unicode_and_pos_ex;
{$modeswitch UnicodeStrings}
{$codepage utf-8}
//{$mode delphiunicode}
uses sysutils,strutils;
 
var S1,S2  : String;
    S3     : String;
    C1,C2  : Char  ;
 
Procedure InitVars;
begin
  {0        1       }
  {12345678901234567}
   S1 := 'bonjour sérénitàa';
   S3 := 'à';
   //  C  := 'à';       //<== Error: Incompatible types: got "Constant String" expected "WideChar"
   C1 := 'a';         //This compiles just fine. Why the difference?
   C2 := Char('à');   //this compiles.
end;
 
Procedure TestUnicodeLength;
var len : integer;
begin
  len := Length(S1);
  If len = 17
  then writeln('length works')
  else writeln('length is broken: got ', len, ' expected 17');
end;
 
Procedure TestUnicodePos;
var i : integer;
begin
   Writeln('searching for the substring ''jour'' in ''bonjour sérénitàa''');
   i  := Pos('jour', S1);
   If i = 4
   then writeln('Pos works for non-accented searches')
   else writeln('Pos broken for non-accented searches. Got ', i, ' when expecting 4');
 
   Writeln('searching for the substring ''à'' in ''bonjour sérénitàa''');
   i  := Pos(S3 , S1);
   If i = 16
   then writeln('Pos works for string accented searches')
   else writeln('Pos broken for string accented searches. Got ', i, ' when expecting 16');
 
   Writeln('searching for the char ''à'' in ''bonjour sérénitàa''');
   i  := Pos(C2 , S1);
   If i = 16
   then writeln('Pos works for accented character searches')
   else writeln('Pos broken for accented character searches. Got ', i, ' when expecting 16');
end;
 
Procedure TestUnicodePosEx;
var i : integer;
begin
   Writeln('searching for the substring ''à'' in ''bonjour sérénitàa'' after pos 4');
   i  := PosEx(S3 , S1, 4);
   If i = 16
   then writeln('PosEx works for string accented searches')
   else writeln('PosEx broken for string accented searches. Got ', i, ' when expecting 16');
 
   Writeln('searching for the char ''à'' in ''bonjour sérénitàa'' after post 4.');
   i  := PosEx(C2 , S1, 4);
   If i = 16
   then writeln('PosEx works for accented character searches')
   else writeln('PosEx broken for accented character searches. Got ', i, ' when expecting 16');
end;
 
Procedure TestUnicodeTrimming;
const BlankPad = '      ';
begin
   S2 := BlankPad + S1 + BlankPad;
   S2 := Trim(S2);
   If S1 = S2
   then Writeln('Trim works')
   else Writeln('Trim is broken');
 
   S2 := BlankPad + S1;
   S2 := TrimLeft(S2);
   If S1 = S2
   then Writeln('TrimLeft works')
   else Writeln('TrimLeft is broken');
 
   S2 := S1 + BlankPad;
   S2 := TrimRight(S2);
   If S1 = S2
   then Writeln('RightTrim works')
   else Writeln('RightTrim is broken');
end;
 
Procedure TestUnicodeCopying;
var i: integer;
begin
   i   := Pos(' ',S1);
   S2  := Copy(S1,i+1,Length(S1));
   If S2 = 'sérénitàa'
   then writeln('copy works')
   else writeln('copy broken');
end;
 
Procedure TestUnicodeUpperLower;
begin
   Writeln('original string: ' , S1);
   S2 := UpperCase(S1);
   Writeln('Uppercase S1 = ', S2);
   If S2 = 'BONJOUR SÉRÉNITÀA'
   then Writeln('Uppercase works')
   else Writeln('Uppercase broken: got ', S2, ' instead of BONJOUR SÉRÉNITÀA');
   S2 := Lowercase(S2);
   Writeln('Lowercase S2 = ', S2);
   If S2 = S1
   then Writeln('Lowercase work')
   else Writeln('Lowercase broken: got ', S2, ' instead of bonjour sérénitàa');
end;
 
begin
 InitVars;
 TestUnicodeLength;
 TestUnicodePos;
 TestUnicodePosEx;
 TestUnicodeTrimming;
 TestUnicodeCopying;
 TestUnicodeUpperLower;
 Readln();
end.
 

« Last Edit: February 21, 2017, 12:50:51 am by EganSolo »

Logged

EganSolo

Sr. Member
Posts: 290

Re: Unicode, and IntToStr and PosEx: How to make it work?

« Reply #13 on: February 21, 2017, 03:41:42 am »

Alright, one more update

Using $mode DelphiUnicode does not work. In fact, I can't even find a way to assign a constant char to a char. C := 'à' does not work, nor does C := Char('à');
Using {$modeswitch UnicodeStrings} in conjunction with {$codepage utf-8} works... almost. If you run the code below, you will see that all the tests succeed but...
console display for utf-8 is lacking. The console displays some of the accented letters but not all. I'm still trying to figure out why. I'm on Windows by the way. Note from the code below that UTF8ToConsole, UTF8ToWinCP and UTF8ToSys don't do anything over and beyond what the rest of the code does. By the way to make this work, manually include the package LazUtils into your command.
SetMultiByteConversionCodePage does nothing either, which is expected since the code page is set to utf-8. By the way, if you're hoping to replace the switch {$codepage utf-8} with the more dynamic call to SetMultiByteConversionCodePage, you will have to contend with the compiler errors if you're using char. To see what I am talking about, simply comment out the {$codepage utf-8} at the start of the program and uncomment SetMultiByteConversionCodePage in the InitVars method. You won't be able to compile the program.
Another suggestion was to use SetConsoleOutputCP, which I have commented in my code. It actually degrades the output. I may not be using this right, but it doesn't help. If you wish to understand why, please check this excellent explanation here: http://forum.lazarus.freepascal.org/index.php?topic=26562.30In fact, it doesn't seem possible to create a console application in Lazarus with full Unicode support. See the program below to understand what I mean.
I am hopeful though that for most string operations I need to perform including parsing, hashing, and regexp search, that there won't be any major issues. I will post back here what I find after I run a battery of regression tests to see if something is amiss.

Code: Pascal [Select][+]

program Unicode_and_pos_ex;
{$modeswitch UnicodeStrings}
{$codepage utf-8}
//{$mode delphiunicode}
uses Lazutf8, SysUtils, StrUtils, Windows, character;
 
var S1,S2  : String;
    S3     : String;
//  C      : WideChar; You will need to uncomment this line if you switch to delphiunicode.
    C      : Char;
 
Procedure InitVars;
begin
  {0        1       }
  {12345678901234567}
   S1 := 'bonjour sérénitàa';
   S3 := 'à';
   C  := 'à';      //Comment this out if you switch to delphiunicode.
// C := Char('à');   You will have to uncomment this line if you switch to delphi unicode.
 
   {
 
     None of these calls affect the console or get it to render Unicode appropriately.
 
     SetMultiByteConversionCodePage(CP_UTF8);
     SetMultiByteRTLFileSystemCodePage(CP_UTF8);
     SetConsoleOutputCP(CP_UTF8); Degrades output to console. Result is worse when this is invoked.
     SetTextCodePage(Output, CP_UTF8); //Degrades output as well.
 
   }
end;
 
Procedure TestUnicodeLength;
var len : integer;
begin
  len := Length(S1);
  If len = 17
  then writeln('length works')
  else writeln('length is broken: got ', len, ' expected 17');
end;
 
Procedure TestUnicodePos;
var i : integer;
begin
   Writeln('searching for the substring ''jour'' in ''bonjour sérénitàa''');
   i  := Pos('jour', S1);
   If i = 4
   then writeln('Pos works for non-accented searches')
   else writeln('Pos broken for non-accented searches. Got ', i, ' when expecting 4');
 
   Writeln('searching for the substring ''à'' in ''bonjour sérénitàa''');
   i  := Pos(S3 , S1);
   If i = 16
   then writeln('Pos works for string accented searches')
   else writeln('Pos broken for string accented searches. Got ', i, ' when expecting 16');
 
   Writeln('searching for the char ''à'' in ''bonjour sérénitàa''');
   i  := Pos(C , S1);
   If i = 16
   then writeln('Pos works for accented character searches')
   else writeln('Pos broken for accented character searches. Got ', i, ' when expecting 16');
end;
 
Procedure TestUnicodePosEx;
var i : integer;
begin
   Writeln('searching for the substring ''à'' in ''bonjour sérénitàa'' after pos 4');
   i  := PosEx(S3 , S1, 4);
   If i = 16
   then writeln('PosEx works for string accented searches')
   else writeln('PosEx broken for string accented searches. Got ', i, ' when expecting 16');
 
   Writeln('searching for the char ''à'' in ''bonjour sérénitàa'' after post 4.');
   i  := PosEx(C , S1, 4);
   If i = 16
   then writeln('PosEx works for accented character searches')
   else writeln('PosEx broken for accented character searches. Got ', i, ' when expecting 16');
end;
 
Procedure TestUnicodeTrimming;
const BlankPad = '      ';
begin
   S2 := BlankPad + S1 + BlankPad;
   S2 := Trim(S2);
   If S1 = S2
   then Writeln('Trim works')
   else Writeln('Trim is broken');
 
   S2 := BlankPad + S1;
   S2 := TrimLeft(S2);
   If S1 = S2
   then Writeln('TrimLeft works')
   else Writeln('TrimLeft is broken');
 
   S2 := S1 + BlankPad;
   S2 := TrimRight(S2);
   If S1 = S2
   then Writeln('RightTrim works')
   else Writeln('RightTrim is broken');
end;
 
Procedure TestUnicodeCopying;
var i: integer;
begin
   i   := Pos(' ',S1);
   S2  := Copy(S1,i+1,Length(S1));
   If S2 = 'sérénitàa'
   then writeln('copy works')
   else writeln('copy broken');
end;
 
Procedure TestUnicodeUpperLower;
const lcAconst : WideChar = 'à';
      ucAconst : WideChar = 'À';
var
  ucA : char;
begin
   Writeln('original string: ' , S1);
   S2 := TCharacter.ToUpper(S1);
   Writeln('Uppercase S1 = ', S2);
   If S2 = 'BONJOUR SÉRÉNITÀA'
   then Writeln('Uppercase works')
   else Writeln('Uppercase broken: got ', S2, ' instead of BONJOUR SÉRÉNITÀA');
   S2 := TCharacter.ToLower(S2);
   Writeln('Lowercase S2 = ', S2);
   If S2 = S1
   then Writeln('Lowercase work')
   else Writeln('Lowercase broken: got ', S2, ' instead of sérénitàa');
   ucA := TCharacter.ToUpper(lcAconst);
   If ucA = ucAConst
   then writeln('character uppercase worked')
   else writeln('character uppercase broken');
end;
 
Procedure TestIdentifier(Const S: String);
var i : integer;
begin
   Write('String ' , S);
   If length(S) = 0 then exit;
   With TCharacter do
     For i := 1 to length(S) do
     If Not (IsLetterOrDigit(S[i]) or (S[i] = '_'))
     then begin
       Writeln(' is not an identifier');
       exit;
     end;
   Writeln(' is an identifier');
end;
 
Procedure TestIdentifiers;
const French_Id1   = '___Éternité133'   ;
      French_Id2   = 'Pérénial_Témérité';
      Croatian_Id1 = 'Vječnost'         ;
      Russian_Id1  = 'Вечность'         ;
      Arabic_Id1   = 'خلود123'          ;
      NonId        = 'J''usqu''à demain';
begin
   TestIdentifier(French_Id1);
   TestIdentifier(French_Id2);
   TestIdentifier(Croatian_Id1);
   TestIdentifier(Russian_Id1);
   TestIdentifier(Arabic_Id1);
   TestIdentifier(NonId);
end;
 
begin
 InitVars;
 TestUnicodeLength;
 TestUnicodePos;
 TestUnicodePosEx;
 TestUnicodeTrimming;
 TestUnicodeCopying;
 TestUnicodeUpperLower;
 TestIdentifiers;
 Readln();
end.
 

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: Unicode, and IntToStr and PosEx: How to make it work? (Read 8805 times)

EganSolo

Unicode, and IntToStr and PosEx: How to make it work?

EganSolo

Re: Unicode and IntToStr: Why the warning?

Thaddy

Re: Unicode, and IntToStr and PosEx: How to make it work?

JuhaManninen

Re: Unicode and IntToStr: Why the warning?

wp

Re: Unicode, and IntToStr and PosEx: How to make it work?

JuhaManninen

Re: Unicode, and IntToStr and PosEx: How to make it work?

Thaddy

Re: Unicode, and IntToStr and PosEx: How to make it work?

EganSolo

Re: Unicode, and IntToStr and PosEx: How to make it work?

EganSolo

Re: Unicode, and IntToStr and PosEx: How to make it work?

Thaddy

Re: Unicode, and IntToStr and PosEx: How to make it work?

EganSolo

Re: Unicode, and IntToStr and PosEx: How to make it work?

howardpc

Re: Unicode, and IntToStr and PosEx: How to make it work?

EganSolo

Re: Unicode, and IntToStr and PosEx: How to make it work?

EganSolo

Re: Unicode, and IntToStr and PosEx: How to make it work?

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook