Recent

Author Topic: UCS2xxToUTF8 should remove LE/BE marks  (Read 3820 times)

Zaher

  • Hero Member
  • *****
  • Posts: 680
    • parmaja.org
UCS2xxToUTF8 should remove LE/BE marks
« on: November 09, 2018, 11:38:58 pm »
If i passed UCS2LE string to this function, directly or through ConvertEncoding functions, it should remove LE/BE marks that at beginging of string

functions UCS2LEToUTF8 and UCS2BEToUTF8

i fixed it like this, is it good, to make it patch, or you can fix it in your way

Code: [Select]
function UCS2BEToUTF8(const s: string): string;
var
  len: Integer;
  Src: PWord;
  Dest: PChar;
  i: Integer;
  c: Word;
begin
  len:=length(s) div 2;
  if len=0 then
    exit('');
  SetLength(Result,len*3);// UTF-8 is at most three times the size
  Src:=PWord(Pointer(s));
  Dest:=PChar(Result);
  for i:=1 to len do begin
    c:=BEtoN(Src^);
    inc(Src);
    if not ((i = 1) and (c = $feff)) then
    begin
      if ord(c)<128 then begin
        Dest^:=chr(c);
        inc(Dest);
      end else begin
        inc(Dest,UnicodeToUTF8SkipErrors(c,Dest));
      end;
    end;
  end;
  len:={%H-}PtrUInt(Dest)-PtrUInt(Result);
  if len>length(Result) then
    raise Exception.Create('');
  SetLength(Result,len);
end;

if not ((i = 1) and (c = $feff)) then
is used in both function
« Last Edit: November 10, 2018, 12:37:35 am by Zaher »

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UCS2xxToUTF8 should remove LE/BE marks
« Reply #1 on: November 10, 2018, 03:46:25 am »
If i passed UCS2LE string to this function, directly or through ConvertEncoding functions, it should remove LE/BE marks that at beginging of string

Why?

On one hand, BOM is one of the Unicode characters (U+FEFF) . Its UTF8 representation is $EF $BB $BF. It could occur in the middle of the string as well (for whatever reason):
Code: Pascal  [Select][+][-]
  1. var
  2.   s: UnicodeString;
  3. begin
  4.   s := 'A'#$FEFF'B';
  5.   MessageBoxW(0,@s[1], @s[1], 0);

On the other hand, if you were to use UCS2LEToUTF8 (or UCS2BEToUTF8) then you had a previous stage to discover the encoding/endianness of the text/file/stream. That stage is the proper place to remove the BOM character, if you must.

Zaher

  • Hero Member
  • *****
  • Posts: 680
    • parmaja.org
Re: UCS2xxToUTF8 should remove LE/BE marks
« Reply #2 on: November 10, 2018, 11:27:51 am »
On one hand, BOM is one of the Unicode characters (U+FEFF) . Its UTF8 representation is $EF $BB $BF. It could occur in the middle of the string as well (for whatever reason):
g/endianness of the text/file/stream. That stage is the proper place to remove the BOM character, if you must.

that is convincing, but it replaced with zero , if i loaded UC file after converting it to UTF8 into SynEdit it show zero chat at first of text.

I have to remove it manually before passing it to SynEdit

munair

  • Hero Member
  • *****
  • Posts: 798
  • compiler developer @SharpBASIC
    • SharpBASIC
Re: UCS2xxToUTF8 should remove LE/BE marks
« Reply #3 on: November 11, 2018, 09:22:31 am »
The proper way to go is to first find out if the file has any BOM at the start. If so, which one is it? That should already tell you what encoding you're dealing with. Here is a code snippet from my FreeBasic project dealing with the same thing. Should be easy to understand:
Code: FreeBasic  [Select][+][-]
  1. function TEncodings.BomUTF16BE(byref s as const string) as string
  2.         return chr(&hFE, &hFF) + s
  3. end function
  4.  
  5. function TEncodings.BomUTF16LE(byref s as const string) as string
  6.         return chr(&hFF, &hFE) + s
  7. end function
  8.  
  9. function TEncodings.BomUTF32BE(byref s as const string) as string
  10.         return chr(&h0, &h0, &hFE, &hFF) + s
  11. end function
  12.  
  13. function TEncodings.BomUTF32LE(byref s as const string) as string
  14.         return chr(&hFF, &hFE, &h0, &h0) + s
  15. end function
  16.  
  17. function TEncodings.Decode(byref s as const string) as string
  18.         ' decode from some Unicode encoding to UTF-8
  19.         if len(s) > 0 then
  20.                 if left(s, 3) = chr(&hEF, &hBB, &hBF) then
  21.                         return DecodeUTF8(mid(s, 4))
  22.                 elseif left(s, 4) = chr(&h0, &h0, &hFE, &hFF) then
  23.                         return DecodeUTF32BE(mid(s, 5))
  24.                 elseif left(s, 4) = chr(&hFF, &hFE, &h0, &h0) then
  25.                         return DecodeUTF32LE(mid(s, 5))
  26.                 elseif left(s, 2) = chr(&hFE, &hFF) then
  27.                         return DecodeUTF16BE(mid(s, 3))
  28.                 elseif left(s, 2) = chr(&hFF, &hFE) then
  29.                         return DecodeUTF16LE(mid(s, 3))
  30.                 else
  31.                         ' assume ASCII -> UTF-8
  32.                         return DecodeUTF8(s)
  33.                 end if
  34.         end if
  35.         return ""
  36. end function
« Last Edit: November 11, 2018, 09:24:35 am by Munair »
keep it simple

Zaher

  • Hero Member
  • *****
  • Posts: 680
    • parmaja.org
Re: UCS2xxToUTF8 should remove LE/BE marks
« Reply #4 on: November 11, 2018, 12:06:19 pm »
I already using GuessEncoding() for detecting the file/string encoding.

But if i want to remove BOM bytes from first of string, before using Convert function, It will take another huge memory to copy the string if my file is huge, for that I like to put skipping BOM inside the convert function.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: UCS2xxToUTF8 should remove LE/BE marks
« Reply #5 on: November 11, 2018, 09:01:59 pm »
I already using GuessEncoding() for detecting the file/string encoding.

But if i want to remove BOM bytes from first of string, before using Convert function, It will take another huge memory to copy the string if my file is huge, for that I like to put skipping BOM inside the convert function.

As Munair and I had mentioned before:
Quote
The proper way to go is to first find out if the file has any BOM at the start.

GuessEncoding uses BOM signature to detect UCS2xx/UTF8BOM. Simply read 4 bytes from the file to detect if its encoding has a BOM. If it does, then you already know the encoding and you can read the file without the BOM mark (skip 2 bytes for UCS2 and 3 bytes for UTF8BOM). Otherwise, read the whole file and pass it again to GuessEncoding.

Notice that GuessEncoding does not discover UTF32xx. Also, for UTF8 *without* a BOM mark, it loops through the whole string.

Zaher

  • Hero Member
  • *****
  • Posts: 680
    • parmaja.org
Re: UCS2xxToUTF8 should remove LE/BE marks
« Reply #6 on: April 14, 2022, 03:57:27 pm »
Back again

LConvEncoding

in ConvertEncodingFromUTF8 function if it UTF8BOM it removed the BO Marks, why UCS2LEToUTF8 not remove LE/BE marks too?

Code: [Select]
  if AFrom=EncodingUTF8BOM then begin Result:=UTF8BOMToUTF8(s); exit; end;
....
  if AFrom=EncodingUCS2LE then begin Result:=UCS2LEToUTF8(s); exit; end;

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: UCS2xxToUTF8 should remove LE/BE marks
« Reply #7 on: April 14, 2022, 06:54:54 pm »
Please file a bugreport.

Bart

Zaher

  • Hero Member
  • *****
  • Posts: 680
    • parmaja.org
Re: UCS2xxToUTF8 should remove LE/BE marks
« Reply #8 on: April 14, 2022, 07:53:30 pm »
OK, I will make patch too, ty

Zoran

  • Hero Member
  • *****
  • Posts: 1830
    • http://wiki.lazarus.freepascal.org/User:Zoran
Re: UCS2xxToUTF8 should remove LE/BE marks
« Reply #9 on: April 29, 2022, 08:59:01 am »
If source ucs2le/ucs2be/utf16 text starts with BOM, of course I expect the resulting utf8 string to have it as well.

The standard function should never assume that the programmer wants more than he actually writes.
Ucs2LEToUtf8 is supposed to convert the string from Ucs2Le to Utf8. All regular Ucs2Le characters ought to be converted and copied.
Assuming that the programmer actually wants to do one step more is very wrong.

If I want to remove BOM, I should check the first character and skip it when calling the function.
If you need this behaviour often, then create your functions:

Code: Pascal  [Select][+][-]
  1. uses
  2.   ...
  3.   LConvEncoding, ...
  4.  
  5. ...
  6.  
  7. function UCS2LEToUtf8SkipBOM(const S: AnsiString): AnsiString;
  8. begin
  9.   if Copy(S, 1, 2) = UTF16LEBOM then
  10.     Result := UCS2LEToUTF8(Copy(S, 3))
  11.   else
  12.     Result := UCS2LEToUTF8(S);
  13. end;
  14.  
  15. function UTF8ToUCS2LESkipBOM(const S: AnsiString): AnsiString;
  16. begin
  17.   if Copy(S, 1, 3) = UTF8BOM then
  18.     Result := UTF8ToUCS2LE(Copy(S, 4))
  19.   else
  20.     Result := UTF8ToUCS2LE(S);
  21. end;
  22.  
  23.  
« Last Edit: April 29, 2022, 09:04:12 am by Zoran »

 

TinyPortal © 2005-2018