Forum > Networking and Web Programming

HTML to text

(1/4) > >>

pcurtis:
Does anyone know how or have some snippet on how to remove all HTML tags from a file, but leave the text?

Thanks.

speter:
Included below is a quick and dirty version of what you want...


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---procedure clean(fn : string);const  magic = '!#%&';var  f : textfile;  s,t : string;  a,b : integer;begin  assignfile(f,fn);  reset(f);  s := '';  repeat    readln(f,t);    s += t+' '+magic;  until eof(f);  closefile(f);   repeat    a := pos('<',s);    b := pos('>',s,a);    if (a > 0) and (b > 0) then      delete(s,a,b-a+1);  until (a=0);   repeat    a := pos(magic,s);    if a > 0 then      begin        memo1.append(copy(s,1,a-1));        delete(s,1,a+3);      end;  until (a=0);end;
Note that this code preserves line-endings and things like tab characters in the original html file.

If you don't care about the line-endings you can leave that out by changing line #14 to

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---s += t+' ';and the last loop to

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---memo1.append(s);or similar. :)

SymbolicFrank:
A good "magic" character sequence to use is #0. It's legal and won't be in the string.

wp:
There's a ready-made function for this task in unit HTML2TextRender:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---  function RenderHTML2Text(const AHTML: String): String;

pcurtis:
Thanks. How to use it?

Navigation

[0] Message Index

[#] Next page

Go to full version