UCS2xxToUTF8 should remove LE/BE marks

Zaher

Hero Member
Posts: 680

UCS2xxToUTF8 should remove LE/BE marks

« on: November 09, 2018, 11:38:58 pm »

If i passed UCS2LE string to this function, directly or through ConvertEncoding functions, it should remove LE/BE marks that at beginging of string

functions UCS2LEToUTF8 and UCS2BEToUTF8

i fixed it like this, is it good, to make it patch, or you can fix it in your way

Code: [Select]

function UCS2BEToUTF8(const s: string): string;
var
  len: Integer;
  Src: PWord;
  Dest: PChar;
  i: Integer;
  c: Word;
begin
  len:=length(s) div 2;
  if len=0 then
    exit('');
  SetLength(Result,len*3);// UTF-8 is at most three times the size
  Src:=PWord(Pointer(s));
  Dest:=PChar(Result);
  for i:=1 to len do begin
    c:=BEtoN(Src^);
    inc(Src);
    if not ((i = 1) and (c = $feff)) then
    begin
      if ord(c)<128 then begin
        Dest^:=chr(c);
        inc(Dest);
      end else begin
        inc(Dest,UnicodeToUTF8SkipErrors(c,Dest));
      end;
    end;
  end;
  len:={%H-}PtrUInt(Dest)-PtrUInt(Result);
  if len>length(Result) then
    raise Exception.Create('');
  SetLength(Result,len);
end;

if not ((i = 1) and (c = $feff)) then
is used in both function

« Last Edit: November 10, 2018, 12:37:35 am by Zaher »

Logged

github.com/parmaja
github.com/zaher
https://codeberg.org/zaher/zaher

engkin

Hero Member
Posts: 3112

Re: UCS2xxToUTF8 should remove LE/BE marks

« Reply #1 on: November 10, 2018, 03:46:25 am »

Quote from: Zaher on November 09, 2018, 11:38:58 pm

If i passed UCS2LE string to this function, directly or through ConvertEncoding functions, it should remove LE/BE marks that at beginging of string

Why?

On one hand, BOM is one of the Unicode characters (U+FEFF) . Its UTF8 representation is $EF $BB $BF. It could occur in the middle of the string as well (for whatever reason):

Code: Pascal [Select][+]

var
  s: UnicodeString;
begin
  s := 'A'#$FEFF'B';
  MessageBoxW(0,@s[1], @s[1], 0);

On the other hand, if you were to use UCS2LEToUTF8 (or UCS2BEToUTF8) then you had a previous stage to discover the encoding/endianness of the text/file/stream. That stage is the proper place to remove the BOM character, if you must.

Logged

Zaher

Hero Member
Posts: 680

Re: UCS2xxToUTF8 should remove LE/BE marks

« Reply #2 on: November 10, 2018, 11:27:51 am »

Quote from: engkin on November 10, 2018, 03:46:25 am

On one hand, BOM is one of the Unicode characters (U+FEFF) . Its UTF8 representation is $EF $BB $BF. It could occur in the middle of the string as well (for whatever reason):
g/endianness of the text/file/stream. That stage is the proper place to remove the BOM character, if you must.

that is convincing, but it replaced with zero , if i loaded UC file after converting it to UTF8 into SynEdit it show zero chat at first of text.

I have to remove it manually before passing it to SynEdit

Logged

github.com/parmaja
github.com/zaher
https://codeberg.org/zaher/zaher

munair

Hero Member
Posts: 798
compiler developer @SharpBASIC

Re: UCS2xxToUTF8 should remove LE/BE marks

« Reply #3 on: November 11, 2018, 09:22:31 am »

The proper way to go is to first find out if the file has any BOM at the start. If so, which one is it? That should already tell you what encoding you're dealing with. Here is a code snippet from my FreeBasic project dealing with the same thing. Should be easy to understand:

Code: FreeBasic [Select][+]

function TEncodings.BomUTF16BE(byref s as const string) as string
        return chr(&hFE, &hFF) + s
end function
 
function TEncodings.BomUTF16LE(byref s as const string) as string
        return chr(&hFF, &hFE) + s
end function
 
function TEncodings.BomUTF32BE(byref s as const string) as string
        return chr(&h0, &h0, &hFE, &hFF) + s
end function
 
function TEncodings.BomUTF32LE(byref s as const string) as string
        return chr(&hFF, &hFE, &h0, &h0) + s
end function
 
function TEncodings.Decode(byref s as const string) as string
        ' decode from some Unicode encoding to UTF-8
        if len(s) > 0 then
                if left(s, 3) = chr(&hEF, &hBB, &hBF) then
                        return DecodeUTF8(mid(s, 4))
                elseif left(s, 4) = chr(&h0, &h0, &hFE, &hFF) then
                        return DecodeUTF32BE(mid(s, 5))
                elseif left(s, 4) = chr(&hFF, &hFE, &h0, &h0) then
                        return DecodeUTF32LE(mid(s, 5))
                elseif left(s, 2) = chr(&hFE, &hFF) then
                        return DecodeUTF16BE(mid(s, 3))
                elseif left(s, 2) = chr(&hFF, &hFE) then
                        return DecodeUTF16LE(mid(s, 3))
                else
                        ' assume ASCII -> UTF-8
                        return DecodeUTF8(s)
                end if
        end if
        return ""
end function

« Last Edit: November 11, 2018, 09:24:35 am by Munair »

Logged

keep it simple

Zaher

Hero Member
Posts: 680

Re: UCS2xxToUTF8 should remove LE/BE marks

« Reply #4 on: November 11, 2018, 12:06:19 pm »

I already using GuessEncoding() for detecting the file/string encoding.

But if i want to remove BOM bytes from first of string, before using Convert function, It will take another huge memory to copy the string if my file is huge, for that I like to put skipping BOM inside the convert function.

Logged

github.com/parmaja
github.com/zaher
https://codeberg.org/zaher/zaher

engkin

Hero Member
Posts: 3112

Re: UCS2xxToUTF8 should remove LE/BE marks

« Reply #5 on: November 11, 2018, 09:01:59 pm »

Quote from: Zaher on November 11, 2018, 12:06:19 pm

I already using GuessEncoding() for detecting the file/string encoding.

But if i want to remove BOM bytes from first of string, before using Convert function, It will take another huge memory to copy the string if my file is huge, for that I like to put skipping BOM inside the convert function.

As Munair and I had mentioned before:

Quote

The proper way to go is to first find out if the file has any BOM at the start.

GuessEncoding uses BOM signature to detect UCS2xx/UTF8BOM. Simply read 4 bytes from the file to detect if its encoding has a BOM. If it does, then you already know the encoding and you can read the file without the BOM mark (skip 2 bytes for UCS2 and 3 bytes for UTF8BOM). Otherwise, read the whole file and pass it again to GuessEncoding.

Notice that GuessEncoding does not discover UTF32xx. Also, for UTF8 *without* a BOM mark, it loops through the whole string.

Logged

Zaher

Hero Member
Posts: 680

Re: UCS2xxToUTF8 should remove LE/BE marks

« Reply #6 on: April 14, 2022, 03:57:27 pm »

Back again

LConvEncoding

in ConvertEncodingFromUTF8 function if it UTF8BOM it removed the BO Marks, why UCS2LEToUTF8 not remove LE/BE marks too?

Code: [Select]

  if AFrom=EncodingUTF8BOM then begin Result:=UTF8BOMToUTF8(s); exit; end;
....
  if AFrom=EncodingUCS2LE then begin Result:=UCS2LEToUTF8(s); exit; end;

Logged

github.com/parmaja
github.com/zaher
https://codeberg.org/zaher/zaher

Bart

Hero Member
Posts: 5290

Re: UCS2xxToUTF8 should remove LE/BE marks

« Reply #7 on: April 14, 2022, 06:54:54 pm »

Please file a bugreport.

Bart

Logged

Zaher

Hero Member
Posts: 680

Re: UCS2xxToUTF8 should remove LE/BE marks

« Reply #8 on: April 14, 2022, 07:53:30 pm »

OK, I will make patch too, ty

Logged

github.com/parmaja
github.com/zaher
https://codeberg.org/zaher/zaher

Zoran

Hero Member
Posts: 1830

Re: UCS2xxToUTF8 should remove LE/BE marks

« Reply #9 on: April 29, 2022, 08:59:01 am »

If source ucs2le/ucs2be/utf16 text starts with BOM, of course I expect the resulting utf8 string to have it as well.

The standard function should never assume that the programmer wants more than he actually writes.
Ucs2LEToUtf8 is supposed to convert the string from Ucs2Le to Utf8. All regular Ucs2Le characters ought to be converted and copied.
Assuming that the programmer actually wants to do one step more is very wrong.

If I want to remove BOM, I should check the first character and skip it when calling the function.
If you need this behaviour often, then create your functions:

Code: Pascal [Select][+]

uses
  ...
  LConvEncoding, ...
 
...
 
function UCS2LEToUtf8SkipBOM(const S: AnsiString): AnsiString;
begin
  if Copy(S, 1, 2) = UTF16LEBOM then
    Result := UCS2LEToUTF8(Copy(S, 3))
  else
    Result := UCS2LEToUTF8(S);
end;
 
function UTF8ToUCS2LESkipBOM(const S: AnsiString): AnsiString;
begin
  if Copy(S, 1, 3) = UTF8BOM then
    Result := UTF8ToUCS2LE(Copy(S, 4))
  else
    Result := UTF8ToUCS2LE(S);
end;
 
 

« Last Edit: April 29, 2022, 09:04:12 am by Zoran »

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: UCS2xxToUTF8 should remove LE/BE marks (Read 3820 times)

Zaher

UCS2xxToUTF8 should remove LE/BE marks

engkin

Re: UCS2xxToUTF8 should remove LE/BE marks

Zaher

Re: UCS2xxToUTF8 should remove LE/BE marks

munair

Re: UCS2xxToUTF8 should remove LE/BE marks

Zaher

Re: UCS2xxToUTF8 should remove LE/BE marks

engkin

Re: UCS2xxToUTF8 should remove LE/BE marks

Zaher

Re: UCS2xxToUTF8 should remove LE/BE marks

Bart

Re: UCS2xxToUTF8 should remove LE/BE marks

Zaher

Re: UCS2xxToUTF8 should remove LE/BE marks

Zoran

Re: UCS2xxToUTF8 should remove LE/BE marks

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook