Recent

Author Topic: HTML to UTF8 edcoding  (Read 1281 times)

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
HTML to UTF8 edcoding
« on: January 15, 2021, 07:21:54 am »
I am being fed some text that has things like (but not limited to) Ώ that I want to convert to nice LCL friendly UTF8 and have it displayed as Omega.

The Ώ is an HTML entity, there are thousands of them ! 

I have tried calls like  HTTPEncode(St), UnescapeHTML(St), UTF8Decode(St), XMLValueToStr(St) and the best they can handle are the five 'basic' ones, <,  >,  ;,  & and  " all of which are pretty easy conversions to UTF8 !

Am I building a huge lookup table here or can someone suggest a library function that that I can use ?

Davo

Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

circular

  • Hero Member
  • *****
  • Posts: 4196
    • Personal webpage
Re: HTML to UTF8 edcoding
« Reply #1 on: January 15, 2021, 07:37:03 am »
Those codes corresponds to Unicode. See: https://en.wikipedia.org/wiki/Character_encodings_in_HTML#Character_references

So basically, you need to parse the text, look for '&#', decode the number and then make the UTF8 string from the code point (function LazUtf8.UnicodeToUTF8).

Note that the final ';' is optional and that there can be any number of digits. So '&#00033&#33' means '!!'
Conscience is the debugger of the mind

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: HTML to UTF8 edcoding
« Reply #2 on: January 15, 2021, 07:52:51 am »
Ahh, thats a much better approach Circular, I will give it a try !

> can be any number of digits.

Oh, that could be fun testing for ....

Anyway, still sounds a lot better than a huge look up table, thank you !

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

circular

  • Hero Member
  • *****
  • Posts: 4196
    • Personal webpage
Re: HTML to UTF8 edcoding
« Reply #3 on: January 15, 2021, 07:59:55 am »
You're welcome  :)
Conscience is the debugger of the mind

paweld

  • Hero Member
  • *****
  • Posts: 970
Re: HTML to UTF8 edcoding
« Reply #4 on: January 15, 2021, 09:21:25 am »
function ResolveHTMLEntityReference in HTMLDefs unit
Best regards / Pozdrawiam
paweld

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: HTML to UTF8 edcoding
« Reply #5 on: January 15, 2021, 10:06:45 am »
function ResolveHTMLEntityReference in HTMLDefs unit
Unfortunately this operates on WideStrings/WideChars, like other html/xml tools of the FCL. This means unnecessary string conversions for UTF8 used by Lazarus. I wonder why the FPC teams focusses here so much on widestrings.

dbannon

  • Hero Member
  • *****
  • Posts: 2786
    • tomboy-ng, a rewrite of the classic Tomboy
Re: HTML to UTF8 edcoding
« Reply #6 on: January 15, 2021, 11:45:03 am »
This works for me -

Code: Pascal  [Select][+][-]
  1. function TForm1.RemoveHTMLNumericCode(var St : string) : boolean;
  2. var
  3.     Target : integer = 1;
  4.     Buff : string = '';
  5. begin
  6.     repeat
  7.         Target := Pos('&#', St, Target);               // that constant is a & and a #, the & is being escaped by renderer ....
  8.         if Target = 0 then exit(False);                 // None left, lets get out of here.
  9.         if Target + 3 > St.length then exit(false);     // No room ....
  10.         inc(Target, 2);
  11.         if (St[Target] in ['0'..'9']) then begin        // Looks like we have one !
  12.             while St[Target] in ['0'..'9'] do begin
  13.                 Buff := Buff + St[Target];
  14.                 delete(St, Target, 1);
  15.                 if Target > St.Length then break;
  16.             end;
  17.             if St[Target] = ';' then delete(St, Target, 1);
  18.             Dec(Target, 2);                             // Back to start of Entity
  19.             delete(St, Target, 2);
  20.             insert(UnicodeToUTF8(Buff.ToInteger), St, Target);
  21.             exit(True);
  22.         end;                                            // Oh, well, how sad, try again ?
  23.    until Target > St.length;
  24. end;                        
  25.  
  26.     St := 'some text &amp; and &#911 and &#946';
  27.     Memo1.append('===== Starting with [' + St + ']');
  28.     while RemoveHTMLNumericCode(St) do;
  29.     Memo1.Append(St);        
  30.  

Thanks Folks !

Davo
« Last Edit: January 15, 2021, 11:47:55 am by dbannon »
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: HTML to UTF8 edcoding
« Reply #7 on: January 15, 2021, 12:13:34 pm »
function ResolveHTMLEntityReference in HTMLDefs unit
Unfortunately this operates on WideStrings/WideChars, like other html/xml tools of the FCL. This means unnecessary string conversions for UTF8 used by Lazarus. I wonder why the FPC teams focusses here so much on widestrings.

I wonder why Lazarus uses utf8 when FPC set up unicode Delphi compatible using Widestrings from the beginning :-)
« Last Edit: January 15, 2021, 12:22:52 pm by marcov »

BeniBela

  • Hero Member
  • *****
  • Posts: 905
    • homepage
Re: HTML to UTF8 edcoding
« Reply #8 on: January 15, 2021, 12:20:55 pm »
I have one here: https://github.com/benibela/bbutils/blob/master/bbutils.pas#L5352

Quite faster and handles corner cases

PascalDragon

  • Hero Member
  • *****
  • Posts: 5446
  • Compiler Developer
Re: HTML to UTF8 edcoding
« Reply #9 on: January 15, 2021, 01:05:31 pm »
function ResolveHTMLEntityReference in HTMLDefs unit
Unfortunately this operates on WideStrings/WideChars, like other html/xml tools of the FCL. This means unnecessary string conversions for UTF8 used by Lazarus. I wonder why the FPC teams focusses here so much on widestrings.
Because this code predates both the introduction of codepage aware AnsiString as well as the reference counted UnicodeString.

 

TinyPortal © 2005-2018