Recent

Author Topic: TStrings.GetTextStr patch  (Read 1056 times)

ASerge

  • Hero Member
  • *****
  • Posts: 1395
TStrings.GetTextStr patch
« on: March 18, 2019, 07:04:57 pm »
Before posting to bugtracker, I decided to show up on the forum first, in case I missed something.

A simple project to show the bug in TStrings.GetTextStr:
Code: Pascal  [Select]
  1. program TestStringWithDiffCodePages;
  2.  
  3. {$APPTYPE CONSOLE}
  4. {$MODE OBJFPC}
  5. {$LONGSTRINGS ON}
  6. {$CODEPAGE UTF8}  // Assuming the file is saved in UTF 8 encoding
  7.  
  8. uses Windows, Classes;  
  9.  
  10. procedure MsgBox(const S: string);
  11. begin
  12.   MessageBoxW(0, PWideChar(UnicodeString(S)), 'Info', MB_ICONINFORMATION);
  13. end;
  14.  
  15. procedure ShowByStringList(const A: array of string);
  16. var
  17.   L: TStringList;
  18. begin
  19.   L := TStringList.Create;
  20.   try
  21.     L.AddStrings(A);
  22.     //L.LineBreak := 'à';
  23.     MsgBox(L.Text);
  24.   finally
  25.     L.Free;
  26.   end;
  27. end;
  28.  
  29. var
  30.   S: RawByteString;
  31.   UserInput: string;
  32. begin
  33.   //SetMultiByteFileSystemCodePage(CP_UTF8); // Imitation of the behavior of LCL
  34.   //ShowByStringList(['', '', '']);
  35.   S := 'It was about 5 hours';
  36.   //SetCodePage(S, 1250);
  37.   Write('Enter any string: ');
  38.   Readln(UserInput);
  39.   ShowByStringList([UserInput, S, 'Il était à peu près 5 heures', '大約5個小時']);
  40. end.
The problem is that GetTextStr simply ignores the code pages of the list strings.
The proposed solution works as fast as the old code, if the code pages of all lines are the same (more frequent case), slower, but correct when different.
It also takes into account the case of a non-standard line break in particular that its code page does not match those of the strings.
Add this to the beginning of the project above to check the fix:
Code: Pascal  [Select]
  1. type
  2.   TStringList = class(Classes.TStringList)
  3.   protected
  4.     function GetTextStr: string; override;
  5.   end;
  6.  
  7. function TStringList.GetTextStr: string;
  8.  
  9.   procedure ConvertToUtf8(var S: string);
  10.   begin
  11.     // Fans of inlining so complicated the AnsiToUtf8, that one this call
  12.     // generates a lot of code, so it is made as a separate procedure
  13.     S := AnsiToUtf8(S);
  14.   end;
  15.  
  16. var
  17.   LLineBreak: string;
  18.   CodePage: TSystemCodePage;
  19.  
  20.   // In most cases, the line break is an ASCII string and does not depend
  21.   // on the code page, but we must to check
  22.   function LineBreakDependsOnCodePage: Boolean;
  23.   var
  24.     ChangedLineBreak: RawByteString;
  25.   begin
  26.     ChangedLineBreak := LLineBreak;
  27.     SetCodePage(ChangedLineBreak, CodePage, True); // With conversion
  28.     Result := (LLineBreak <> ChangedLineBreak);
  29.   end;
  30.  
  31. const
  32.   CLineBreaks: array[TTextLineBreakStyle] of string = (
  33.     #10, // tlbsLF
  34.     #13#10, // tlbsCRLF
  35.     #13); // tlbsCR
  36. var
  37.   i, LCount: SizeInt;
  38.   DifferentCodePages, CodePageValid: Boolean;
  39.   CurrentStringLength, ResultLength, LineBreakLength: SizeInt;
  40.   CurrentString: string;
  41.   InsertPoint: PAnsiChar;
  42. begin
  43.   LCount := Self.Count;
  44.   if LCount <= 0 then
  45.     Exit('');
  46.   CheckSpecialChars;
  47.   LLineBreak := Self.LineBreak;
  48.   // When the LineBreak is default, the TextLineBreakStyle is used
  49.   if LLineBreak = sLineBreak then
  50.     LLineBreak := CLineBreaks[TextLineBreakStyle];
  51.   // Calculate the required place for the result
  52.   ResultLength := 0;
  53.   LineBreakLength := Length(LLineBreak);
  54.   // When scanning, check that the code pages of all strings are the same
  55.   // We do not use the default code page to avoid conversions when
  56.   // all code pages are the same but do not match the default
  57.   DifferentCodePages := False;
  58.   CodePageValid := False;
  59.   CodePage := 0; // Only to suppress the compiler warning
  60.   i := 0;
  61.   repeat
  62.     CurrentString := Strings[i];
  63.     if CurrentString <> '' then
  64.     begin
  65.       if CodePageValid then
  66.         DifferentCodePages := (StringCodePage(CurrentString) <> CodePage)
  67.       else
  68.       begin
  69.         CodePage := StringCodePage(CurrentString);
  70.         CodePageValid := True;
  71.         DifferentCodePages := LineBreakDependsOnCodePage;
  72.       end;
  73.       Inc(ResultLength, Length(CurrentString));
  74.     end;
  75.     Inc(ResultLength, LineBreakLength);
  76.     Inc(i);
  77.   until (i >= LCount) or DifferentCodePages;
  78.   if not CodePageValid then // All strings are empty, so we use the LineBreak code page
  79.   begin
  80.     if LLineBreak = '' then // Empty strings with an empty line break
  81.       Exit('');
  82.     CodePage := StringCodePage(LLineBreak);
  83.   end;
  84.   if DifferentCodePages then // Convert all strings to UTF8
  85.   begin
  86.     CodePage := CP_UTF8;
  87.     ConvertToUtf8(LLineBreak);
  88.     LineBreakLength := Length(LLineBreak);
  89.     ResultLength := 0;
  90.     i := 0;
  91.     repeat
  92.       CurrentString := Strings[i];
  93.       if CurrentString <> '' then
  94.       begin
  95.         ConvertToUtf8(CurrentString);
  96.         Inc(ResultLength, Length(CurrentString));
  97.       end;
  98.       Inc(ResultLength, LineBreakLength);
  99.       Inc(i);
  100.     until i >= LCount;
  101.   end;
  102.   // OK, new length defined, allocate and copy to it
  103.   SetLength(Result, ResultLength);
  104.   SetCodePage(RawByteString(Result), CodePage, False); // without conversion!
  105.   InsertPoint := Pointer(Result);
  106.   i := 0;
  107.   repeat
  108.     CurrentString := Strings[i];
  109.     if CurrentString <> '' then
  110.     begin
  111.       if DifferentCodePages then
  112.         ConvertToUtf8(CurrentString);
  113.       CurrentStringLength := Length(CurrentString);
  114.       System.Move(Pointer(CurrentString)^, InsertPoint^, CurrentStringLength);
  115.       Inc(InsertPoint, CurrentStringLength);
  116.     end;
  117.     System.Move(Pointer(LLineBreak)^, InsertPoint^, LineBreakLength);
  118.     Inc(InsertPoint, LineBreakLength);
  119.     Inc(i);
  120.   until i >= LCount;
  121.   if SkipLastLineBreak then
  122.     SetLength(Result, Length(Result) - LineBreakLength);
  123. end;

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7362
Re: TStrings.GetTextStr patch
« Reply #1 on: March 18, 2019, 07:21:14 pm »
The simplest test is to test it with an encoding other than utf8, and a test set with strings that use the full range (0..255) of that encoding.

Anyway,  a conversion to utf8 that is not guarded by a check for a codepage (e.g. stringcodepage) is enormously doubtful.

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: TStrings.GetTextStr patch
« Reply #2 on: March 18, 2019, 08:01:56 pm »
Two notes:
1-AnsiToUtf8 converts to UnicodeString before it converts to UTF8.
2-When DifferentCodePages is true, the conversion to UTF8 occurs twice. Which makes it four coversions: codepage->Unicode->UTF8 twice.

Might be faster, when different code pages, to initially have 4x the total length of strings (to account for the worst case, and this total length should work for both UTF16 and UTF8), then convert to UnicodeString before converting to UTF8 at the end.

Thaddy

  • Hero Member
  • *****
  • Posts: 8681
Re: TStrings.GetTextStr patch
« Reply #3 on: March 18, 2019, 09:06:56 pm »
Two notes:
1-AnsiToUtf8 converts to UnicodeString before it converts to UTF8.
2-When DifferentCodePages is true, the conversion to UTF8 occurs twice. Which makes it four coversions: codepage->Unicode->UTF8 twice.

Might be faster, when different code pages, to initially have 4x the total length of strings (to account for the worst case, and this total length should work for both UTF16 and UTF8), then convert to UnicodeString before converting to UTF8 at the end.
3. Also note Tstrings is a FPC class and not part of Lazarus. The FPC compiler supports Ansi, UTF16 and ShortString. Be careful with UTF8 other than its codepage support for patches that are geared towards UTF8: doesn't belong in FPC..
4. The conversion to UTF8 from UTF16 is already transparent and assignment compatible.
5. The UTF16 lengths can indeed differ from UTF8: 2-4 and 1-2-3-4 bytes
« Last Edit: March 18, 2019, 09:09:08 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

ASerge

  • Hero Member
  • *****
  • Posts: 1395
Re: TStrings.GetTextStr patch
« Reply #4 on: March 18, 2019, 09:28:18 pm »
The simplest test is to test it with an encoding other than utf8, and a test set with strings that use the full range (0..255) of that encoding.
Readln(UserInput);

ASerge

  • Hero Member
  • *****
  • Posts: 1395
Re: TStrings.GetTextStr patch
« Reply #5 on: March 18, 2019, 09:32:06 pm »
Two notes:
1-AnsiToUtf8 converts to UnicodeString before it converts to UTF8.
Of course, this is required for the correct conversion of different code pages.

Quote
2-When DifferentCodePages is true, the conversion to UTF8 occurs twice. Which makes it four coversions: codepage->Unicode->UTF8 twice.
As I said, this rarely happens, but if you offer a faster, but correct option, it will be great.

ASerge

  • Hero Member
  • *****
  • Posts: 1395
Re: TStrings.GetTextStr patch
« Reply #6 on: March 18, 2019, 09:36:33 pm »
3. Also note Tstrings is a FPC class and not part of Lazarus. The FPC compiler supports Ansi, UTF16 and ShortString. Be careful with UTF8 other than its codepage support for patches that are geared towards UTF8: doesn't belong in FPC..
Classes compiled with {$LONGSTRINGS ON}, so string is AnsiString.

ASerge

  • Hero Member
  • *****
  • Posts: 1395
Re: TStrings.GetTextStr patch
« Reply #7 on: March 31, 2019, 02:38:07 pm »
2-When DifferentCodePages is true, the conversion to UTF8 occurs twice. Which makes it four coversions: codepage->Unicode->UTF8 twice.
As I said, this rarely happens, but if you offer a faster, but correct option, it will be great.
Next version. With speed test.