Recent

Author Topic: HTML to text  (Read 864 times)

pcurtis

  • Hero Member
  • *****
  • Posts: 939
HTML to text
« on: May 20, 2022, 08:19:03 am »
Does anyone know how or have some snippet on how to remove all HTML tags from a file, but leave the text?

Thanks.
Windows 10 20H2
Laz 2.2.0
FPC 3.2.2

speter

  • Sr. Member
  • ****
  • Posts: 276
Re: HTML to text
« Reply #1 on: May 20, 2022, 08:50:00 am »
Included below is a quick and dirty version of what you want...

Code: Pascal  [Select][+][-]
  1. procedure clean(fn : string);
  2. const
  3.   magic = '!#%&';
  4. var
  5.   f : textfile;
  6.   s,t : string;
  7.   a,b : integer;
  8. begin
  9.   assignfile(f,fn);
  10.   reset(f);
  11.   s := '';
  12.   repeat
  13.     readln(f,t);
  14.     s += t+' '+magic;
  15.   until eof(f);
  16.   closefile(f);
  17.  
  18.   repeat
  19.     a := pos('<',s);
  20.     b := pos('>',s,a);
  21.     if (a > 0) and (b > 0) then
  22.       delete(s,a,b-a+1);
  23.   until (a=0);
  24.  
  25.   repeat
  26.     a := pos(magic,s);
  27.     if a > 0 then
  28.       begin
  29.         memo1.append(copy(s,1,a-1));
  30.         delete(s,1,a+3);
  31.       end;
  32.   until (a=0);
  33. end;

Note that this code preserves line-endings and things like tab characters in the original html file.

If you don't care about the line-endings you can leave that out by changing line #14 to
Code: Pascal  [Select][+][-]
  1. s += t+' ';
and the last loop to
Code: Pascal  [Select][+][-]
  1. memo1.append(s);
or similar. :)
« Last Edit: May 20, 2022, 08:58:49 am by speter »
I climbed mighty mountains, and saw that they were actually tiny foothills. :)

Laz 2.2.0 / FPC 3.2.2 / Windows 11 (64bit)

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 968
Re: HTML to text
« Reply #2 on: May 20, 2022, 09:11:04 am »
A good "magic" character sequence to use is #0. It's legal and won't be in the string.

wp

  • Hero Member
  • *****
  • Posts: 9736
Re: HTML to text
« Reply #3 on: May 20, 2022, 05:27:13 pm »
There's a ready-made function for this task in unit HTML2TextRender:
Code: Pascal  [Select][+][-]
  1.   function RenderHTML2Text(const AHTML: String): String;
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

pcurtis

  • Hero Member
  • *****
  • Posts: 939
Re: HTML to text
« Reply #4 on: May 20, 2022, 08:58:08 pm »
Thanks. How to use it?
Windows 10 20H2
Laz 2.2.0
FPC 3.2.2

wp

  • Hero Member
  • *****
  • Posts: 9736
Re: HTML to text
« Reply #5 on: May 20, 2022, 09:44:28 pm »
I must apologize: this function is not yet included in Laz v2.2.2 or earlier. But the unit is self-contained and can be used also in older versions.

How to apply this function? Simply load the html file into a string and pass that to the RenderHTML2Text function. The returned string contains the text only.

See attached demo.

Alternatively you can use the fasthtmlparser unit which comes with fpc. See my post at https://forum.lazarus.freepascal.org/index.php/topic,43090.msg301176.html#msg301176.
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1118
Re: HTML to text
« Reply #6 on: May 21, 2022, 01:20:29 pm »
hello,
p.curtis have you an example of html file that you want to convert  ?
friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

pcurtis

  • Hero Member
  • *****
  • Posts: 939
Re: HTML to text
« Reply #7 on: May 22, 2022, 05:04:57 pm »
Thanks. Try this.

In browser in looks like (picture), of course text is without formatting.
« Last Edit: May 22, 2022, 05:41:40 pm by pcurtis »
Windows 10 20H2
Laz 2.2.0
FPC 3.2.2

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1118
Re: HTML to text
« Reply #8 on: May 22, 2022, 05:59:20 pm »
hello,
try the last wp's project. Seems to be OK with your html file.
Friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

pcurtis

  • Hero Member
  • *****
  • Posts: 939
Re: HTML to text
« Reply #9 on: May 22, 2022, 06:12:33 pm »
More or less yes, but the are issues.

Each paragraph starts with a space
and there are too many blank lines.
« Last Edit: May 22, 2022, 06:19:58 pm by pcurtis »
Windows 10 20H2
Laz 2.2.0
FPC 3.2.2

wp

  • Hero Member
  • *****
  • Posts: 9736
Re: HTML to text
« Reply #10 on: May 22, 2022, 11:00:03 pm »
I wonder which application writes such an ugly html file. Looks like MS Office from the identifiers...

Anyway, the empty lines are caused by the file content itself. Found pieces like this at several places:

Code: Text  [Select][+][-]
  1. <p class=MsoNormal><span style='font-family:"Arial","sans-serif";mso-fareast-font-family:
  2. "Times New Roman"'><o:p>&nbsp;</o:p></span></p>  

These are paragraphs containing a space, i.e. they look like empty lines.

The spaces added before each paragraph are caused by the RenderHTML2Text function since it replaces linefeeds and returns by space characters which IMHO is not always correct. If you don't like this, file a bug report, maybe Juha (who wrote this function) can have a look.
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

 

TinyPortal © 2005-2018