Recent

Author Topic: Nearly localized Synedit tab error  (Read 677 times)

Borneq

  • Full Member
  • ***
  • Posts: 238
Nearly localized Synedit tab error
« on: December 12, 2019, 08:17:45 am »
If I applied wordwrap, begins appear error on opening binary files. Previously I can't localized because it trash memory and caused error fat away. But is good thing - range checking:
In TSynEditStringTabExpander.ExpandedString"
Code: Pascal  [Select][+][-]
  1.     CharWidths := GetPhysicalCharWidths(Pchar(Line), length(Line), Index);
  2.     l := 0;
  3.     for i := 0 to length(CharWidths)-1 do
  4.       l := l + (CharWidths[i] and PCWMask);
  5.     SetLength(Result, l); <--here set length to 91
  6.  
  7.     l := 1;
  8.     for i := 1 to length(CharWidths) do begin
  9.       if Line[i] <> #9 then begin
  10.         Result[l] := Line[i];<--error when l=92, length(CharWidths)=118
  11.         inc(l);
  12.       end else begin
  13.         for j := 1 to (CharWidths[i-1] and PCWMask) do begin
  14.           Result[l] := ' ';
  15.           inc(l);
  16.         end;
  17.       end;
  18.     end;    
  19.  
CharWidths in debugger:
Code: Pascal  [Select][+][-]
  1. Len=118:
  2. (1,
  3. 0,
  4. 1,
  5. 0,
  6. 0,
  7. ...)
What mean CharWidths? 0 is len of bad UTF8 code? How correct it?

To line #9: CharWidths is only about tabs? if this will be corrected:
Code: Pascal  [Select][+][-]
  1. if CharWidths[i]=0 then continue;

In TSynEditStringTabExpander.DoGetPhysicalCharWidths is
Code: Pascal  [Select][+][-]
  1.   for i := 0 to LineLen - 1 do begin
  2.     if (PWidths^ and PCWMask) <> 0 then begin
  3.       if Line^ = #9 then begin
  4.         PWidths^ := (FTabWidth - (j mod FTabWidth) and PCWMask) or (PWidths^  and (not PCWMask));
  5.         HasTab := True;
  6.       end;
  7.       j := j + (PWidths^ and PCWMask);
  8.     end;
  9.     inc(Line);
  10.     inc(PWidths);
  11.   end;
  12.  

It calls inherited TSynEditStringBidiChars.DoGetPhysicalCharWidths where are
Code: Pascal  [Select][+][-]
  1.       #$09, //Segment_Separator
  2.       #$20, // White_Space
  3.       #$21..#$22, #$26..#$2A, // Other_Neutral
  4.       #$2B, #$2D,  // EN (European Seperator)
  5.       #$23..#$25,  // European Terminator
  6.       #$2C, #$2E, #$3A, // Common Separator
  7.       #$3B..#$40, #$5B..#$60, #$7B..#$7E // Other_Neutral
  8.  
etc..

In normal tabbed lines it is:
1 1 1 1 1 3 1 1 1
only 1's and 1 to tabWidth for tabs, no zeros
how handle zeros? Should avoid 0's in table CharWidths or do with 0's in ExpandedString?
« Last Edit: December 12, 2019, 08:54:03 am by Borneq »

Borneq

  • Full Member
  • ***
  • Posts: 238
Re: Nearly localized Synedit tab error
« Reply #1 on: December 12, 2019, 09:35:18 am »
Fix:
Code: Pascal  [Select][+][-]
  1.     l := 1;
  2.     for i := 1 to length(CharWidths) do begin
  3.       //if CharWidths[i-1]=0 then continue; <-- this is now unneeded
  4.         for j := 1 to (CharWidths[i-1] and PCWMask) do begin
  5.           if Line[i] <> #9 then
  6.             Result[l] := Line[i]
  7.           else
  8.             Result[l] := ' ';
  9.           inc(l);
  10.         end;
  11.     end;
  12.     Assert(l=ll+1);
  13.  
Is good for wordwrap tabbed file and binary file.
But is also second error independly of my fix:
If on tabbed file we turn on wrapping and next turn off, whole text disappears. Binary file not disappears. I attach binary "bad" and tabbed file.

« Last Edit: December 12, 2019, 09:56:28 am by Borneq »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6592
    • wiki
Re: Nearly localized Synedit tab error
« Reply #2 on: December 12, 2019, 03:57:37 pm »
About char widths.

In unicode there are so called full width chars. E.g. some Japanese Katakana chars. トシシ (I have no idea what those chars mean, if anything. I randomly hit keys.)
Even in strictly mono-spaced fonts, those are twice as wide as a normal char.
So they will return a "2" in char-width.

Tabs return anything from 1 to whatever size needs to be skipped.

0 can happen for two reasons:
- A utf8 continuation byte. Lots of Unicode codepoints are encoded (in utf8) using more than one codeunit (byte). So the first byte will be the display width. The others will be 0.
- The first (and all continuation bytes) of a zero-width char (zero-width space, RTL-marker, ......) or a "combining codepoint".


The entire Charwidth array concept is quite costly. It may get replaced someday.....

Borneq

  • Full Member
  • ***
  • Posts: 238
Re: Nearly localized Synedit tab error
« Reply #3 on: December 13, 2019, 11:38:12 pm »
In https://github.com/User4martin/lazarus/blob/25a6f81cecd036e027ded2330e5c0359518f35ce/components/synedit/synedittexttabexpander.pas#L295L297
is assumed, that only #9 can have CharWidth<>1, but for Line <> #9 CharWidth can be equal even zero.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6592
    • wiki
Re: Nearly localized Synedit tab error
« Reply #4 on: December 14, 2019, 01:05:17 am »
Other chars with width <> 1 are simply copied byte by byte.

Only tabs are replaced by space(s)

The only assumption made is that any tab will actually have a charwitdh >= 1.
And if it does not, then it will be replaced by empty string (which actually is correct, given the incorrect input)
« Last Edit: December 14, 2019, 01:07:24 am by Martin_fr »

 

TinyPortal © 2005-2018