Recent

Author Topic: HTML to text  (Read 2788 times)

pcurtis

  • Hero Member
  • *****
  • Posts: 951
HTML to text
« on: May 20, 2022, 08:19:03 am »
Does anyone know how or have some snippet on how to remove all HTML tags from a file, but leave the text?

Thanks.
Windows 10 20H2
Laz 2.2.0
FPC 3.2.2

speter

  • Sr. Member
  • ****
  • Posts: 338
Re: HTML to text
« Reply #1 on: May 20, 2022, 08:50:00 am »
Included below is a quick and dirty version of what you want...

Code: Pascal  [Select][+][-]
  1. procedure clean(fn : string);
  2. const
  3.   magic = '!#%&';
  4. var
  5.   f : textfile;
  6.   s,t : string;
  7.   a,b : integer;
  8. begin
  9.   assignfile(f,fn);
  10.   reset(f);
  11.   s := '';
  12.   repeat
  13.     readln(f,t);
  14.     s += t+' '+magic;
  15.   until eof(f);
  16.   closefile(f);
  17.  
  18.   repeat
  19.     a := pos('<',s);
  20.     b := pos('>',s,a);
  21.     if (a > 0) and (b > 0) then
  22.       delete(s,a,b-a+1);
  23.   until (a=0);
  24.  
  25.   repeat
  26.     a := pos(magic,s);
  27.     if a > 0 then
  28.       begin
  29.         memo1.append(copy(s,1,a-1));
  30.         delete(s,1,a+3);
  31.       end;
  32.   until (a=0);
  33. end;

Note that this code preserves line-endings and things like tab characters in the original html file.

If you don't care about the line-endings you can leave that out by changing line #14 to
Code: Pascal  [Select][+][-]
  1. s += t+' ';
and the last loop to
Code: Pascal  [Select][+][-]
  1. memo1.append(s);
or similar. :)
« Last Edit: May 20, 2022, 08:58:49 am by speter »
I climbed mighty mountains, and saw that they were actually tiny foothills. :)

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313
Re: HTML to text
« Reply #2 on: May 20, 2022, 09:11:04 am »
A good "magic" character sequence to use is #0. It's legal and won't be in the string.

wp

  • Hero Member
  • *****
  • Posts: 11830
Re: HTML to text
« Reply #3 on: May 20, 2022, 05:27:13 pm »
There's a ready-made function for this task in unit HTML2TextRender:
Code: Pascal  [Select][+][-]
  1.   function RenderHTML2Text(const AHTML: String): String;

pcurtis

  • Hero Member
  • *****
  • Posts: 951
Re: HTML to text
« Reply #4 on: May 20, 2022, 08:58:08 pm »
Thanks. How to use it?
Windows 10 20H2
Laz 2.2.0
FPC 3.2.2

wp

  • Hero Member
  • *****
  • Posts: 11830
Re: HTML to text
« Reply #5 on: May 20, 2022, 09:44:28 pm »
I must apologize: this function is not yet included in Laz v2.2.2 or earlier. But the unit is self-contained and can be used also in older versions.

How to apply this function? Simply load the html file into a string and pass that to the RenderHTML2Text function. The returned string contains the text only.

See attached demo.

Alternatively you can use the fasthtmlparser unit which comes with fpc. See my post at https://forum.lazarus.freepascal.org/index.php/topic,43090.msg301176.html#msg301176.

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: HTML to text
« Reply #6 on: May 21, 2022, 01:20:29 pm »
hello,
p.curtis have you an example of html file that you want to convert  ?
friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

pcurtis

  • Hero Member
  • *****
  • Posts: 951
Re: HTML to text
« Reply #7 on: May 22, 2022, 05:04:57 pm »
Thanks. Try this.

In browser in looks like (picture), of course text is without formatting.
« Last Edit: May 22, 2022, 05:41:40 pm by pcurtis »
Windows 10 20H2
Laz 2.2.0
FPC 3.2.2

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: HTML to text
« Reply #8 on: May 22, 2022, 05:59:20 pm »
hello,
try the last wp's project. Seems to be OK with your html file.
Friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

pcurtis

  • Hero Member
  • *****
  • Posts: 951
Re: HTML to text
« Reply #9 on: May 22, 2022, 06:12:33 pm »
More or less yes, but the are issues.

Each paragraph starts with a space
and there are too many blank lines.
« Last Edit: May 22, 2022, 06:19:58 pm by pcurtis »
Windows 10 20H2
Laz 2.2.0
FPC 3.2.2

wp

  • Hero Member
  • *****
  • Posts: 11830
Re: HTML to text
« Reply #10 on: May 22, 2022, 11:00:03 pm »
I wonder which application writes such an ugly html file. Looks like MS Office from the identifiers...

Anyway, the empty lines are caused by the file content itself. Found pieces like this at several places:

Code: Text  [Select][+][-]
  1. <p class=MsoNormal><span style='font-family:"Arial","sans-serif";mso-fareast-font-family:
  2. "Times New Roman"'><o:p>&nbsp;</o:p></span></p>  

These are paragraphs containing a space, i.e. they look like empty lines.

The spaces added before each paragraph are caused by the RenderHTML2Text function since it replaces linefeeds and returns by space characters which IMHO is not always correct. If you don't like this, file a bug report, maybe Juha (who wrote this function) can have a look.

maurobio

  • Hero Member
  • *****
  • Posts: 623
  • Ecology is everything.
    • GitHub
Re: HTML to text
« Reply #11 on: June 23, 2023, 01:20:51 pm »
Dear ALL,

Just for completion, here are two functions that you may also find useful:

Code: Pascal  [Select][+][-]
  1. function StripHTML(S: string): string;
  2. var
  3.   TagBegin, TagEnd, TagLength: integer;
  4. begin
  5.   TagBegin := Pos( '<', S);      // search position of first <
  6.  
  7.   while (TagBegin > 0) do begin  // while there is a < in S
  8.     TagEnd := Pos('>', S);              // find the matching >
  9.     TagLength := TagEnd - TagBegin + 1;
  10.     Delete(S, TagBegin, TagLength);     // delete the tag
  11.     TagBegin:= Pos( '<', S);            // search for next <
  12.   end;
  13.  
  14.   Result := S;                   // give the result
  15. end;
  16.  

Code: Pascal  [Select][+][-]
  1. function StripTags(const S: string): string;
  2. var
  3.   Len: Integer;
  4.  
  5.   function ReadUntil(const ReadFrom: Integer; const C: Char): Integer;
  6.   var
  7.     j: Integer;
  8.   begin
  9.     for j := ReadFrom to Len do
  10.       if (s[j] = C) then
  11.       begin
  12.         Result := j;
  13.         Exit;
  14.       end;
  15.     Result := Len+1;
  16.   end;
  17.  
  18. var
  19.   i, APos: Integer;
  20. begin
  21.   Len := Length(S);
  22.   i := 0;
  23.   Result := '';
  24.   while (i <= Len) do
  25.   begin
  26.     Inc(i);
  27.     APos := ReadUntil(i, '<');
  28.     Result := Result + Copy(S, i, APos-i);
  29.     i := ReadUntil(APos+1, '>');
  30.   end;
  31. end;
  32.  

Hope it helps!

With warmest regards,
UCSD Pascal / Burroughs 6700 / Master Control Program
Delphi 7.0 Personal Edition
Lazarus 2.0.12 - FPC 3.2.0 on GNU/Linux Mint 19.1, Lubuntu 18.04, Windows XP SP3, Windows 7 Professional, Windows 10 Home

Roland57

  • Sr. Member
  • ****
  • Posts: 416
    • msegui.net
Re: HTML to text
« Reply #12 on: June 23, 2023, 05:23:14 pm »
See attached demo.

Weird, but here (Linux, Lazarus 2.2.6), with the HTM sample provided by pcurtis, I get no text. With the sample included in your demo, it works fine.
My projects are on Gitlab and on Codeberg.

Roland57

  • Sr. Member
  • ****
  • Posts: 416
    • msegui.net
Re: HTML to text
« Reply #13 on: June 23, 2023, 05:39:27 pm »
@pcurtis

For information, I converted your sample with pandoc (because I didn't get wp's demo to work for me):

$ pandoc -f html -t plain text2.htm -o text2.txt
[WARNING] text2.htm is not UTF-8 encoded: falling back to latin1.


After that, I opened text2.txt in a hex editor (see attached picture). We can see that the undesired lines are unbreakable spaces mixed with line endings.

I would use RegExpr to replace or delete unbreakable spaces, like this:

Code: Pascal  [Select][+][-]
  1. s := ReplaceRegExpr('\xC2\xA0', s, '', FALSE);
My projects are on Gitlab and on Codeberg.

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: HTML to text
« Reply #14 on: June 24, 2023, 01:21:46 am »
Hello Roland,
Weird, but here (Linux, Lazarus 2.2.6), with the HTM sample provided by pcurtis, I get no text. With the sample included in your demo, it works fine.
it is because the file is windows-1252 encoded :
Code: Text  [Select][+][-]
  1. <meta http-equiv=Content-Type content="text/html; charset=windows-1252">

if you encode the file in utf-8 it is ok to convert it (file encoded utf-8 in attachment).

OK with Ubuntu 20.04  Lazarus 2.2.0

Friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

 

TinyPortal © 2005-2018