Forum > LCL

UCS2xxToUTF8 should remove LE/BE marks

(1/2) > >>

Zaher:
If i passed UCS2LE string to this function, directly or through ConvertEncoding functions, it should remove LE/BE marks that at beginging of string

functions UCS2LEToUTF8 and UCS2BEToUTF8

i fixed it like this, is it good, to make it patch, or you can fix it in your way


--- Code: ---function UCS2BEToUTF8(const s: string): string;
var
  len: Integer;
  Src: PWord;
  Dest: PChar;
  i: Integer;
  c: Word;
begin
  len:=length(s) div 2;
  if len=0 then
    exit('');
  SetLength(Result,len*3);// UTF-8 is at most three times the size
  Src:=PWord(Pointer(s));
  Dest:=PChar(Result);
  for i:=1 to len do begin
    c:=BEtoN(Src^);
    inc(Src);
    if not ((i = 1) and (c = $feff)) then
    begin
      if ord(c)<128 then begin
        Dest^:=chr(c);
        inc(Dest);
      end else begin
        inc(Dest,UnicodeToUTF8SkipErrors(c,Dest));
      end;
    end;
  end;
  len:={%H-}PtrUInt(Dest)-PtrUInt(Result);
  if len>length(Result) then
    raise Exception.Create('');
  SetLength(Result,len);
end;

--- End code ---

if not ((i = 1) and (c = $feff)) then
is used in both function

engkin:

--- Quote from: Zaher on November 09, 2018, 11:38:58 pm ---If i passed UCS2LE string to this function, directly or through ConvertEncoding functions, it should remove LE/BE marks that at beginging of string

--- End quote ---

Why?

On one hand, BOM is one of the Unicode characters (U+FEFF) . Its UTF8 representation is $EF $BB $BF. It could occur in the middle of the string as well (for whatever reason):

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---var  s: UnicodeString;begin  s := 'A'#$FEFF'B';  MessageBoxW(0,@s[1], @s[1], 0);
On the other hand, if you were to use UCS2LEToUTF8 (or UCS2BEToUTF8) then you had a previous stage to discover the encoding/endianness of the text/file/stream. That stage is the proper place to remove the BOM character, if you must.

Zaher:

--- Quote from: engkin on November 10, 2018, 03:46:25 am ---On one hand, BOM is one of the Unicode characters (U+FEFF) . Its UTF8 representation is $EF $BB $BF. It could occur in the middle of the string as well (for whatever reason):
g/endianness of the text/file/stream. That stage is the proper place to remove the BOM character, if you must.

--- End quote ---

that is convincing, but it replaced with zero , if i loaded UC file after converting it to UTF8 into SynEdit it show zero chat at first of text.

I have to remove it manually before passing it to SynEdit

munair:
The proper way to go is to first find out if the file has any BOM at the start. If so, which one is it? That should already tell you what encoding you're dealing with. Here is a code snippet from my FreeBasic project dealing with the same thing. Should be easy to understand:

--- Code: FreeBasic  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---function TEncodings.BomUTF16BE(byref s as const string) as string        return chr(&hFE, &hFF) + send function function TEncodings.BomUTF16LE(byref s as const string) as string        return chr(&hFF, &hFE) + send function function TEncodings.BomUTF32BE(byref s as const string) as string        return chr(&h0, &h0, &hFE, &hFF) + send function function TEncodings.BomUTF32LE(byref s as const string) as string        return chr(&hFF, &hFE, &h0, &h0) + send function function TEncodings.Decode(byref s as const string) as string        ' decode from some Unicode encoding to UTF-8        if len(s) > 0 then                if left(s, 3) = chr(&hEF, &hBB, &hBF) then                        return DecodeUTF8(mid(s, 4))                elseif left(s, 4) = chr(&h0, &h0, &hFE, &hFF) then                        return DecodeUTF32BE(mid(s, 5))                elseif left(s, 4) = chr(&hFF, &hFE, &h0, &h0) then                        return DecodeUTF32LE(mid(s, 5))                elseif left(s, 2) = chr(&hFE, &hFF) then                        return DecodeUTF16BE(mid(s, 3))                elseif left(s, 2) = chr(&hFF, &hFE) then                        return DecodeUTF16LE(mid(s, 3))                else                        ' assume ASCII -> UTF-8                        return DecodeUTF8(s)                end if        end if        return ""end function

Zaher:
I already using GuessEncoding() for detecting the file/string encoding.

But if i want to remove BOM bytes from first of string, before using Convert function, It will take another huge memory to copy the string if my file is huge, for that I like to put skipping BOM inside the convert function.

Navigation

[0] Message Index

[#] Next page

Go to full version