Recent

Author Topic: Component for stripping HTML?  (Read 3758 times)

HatForCat

  • Sr. Member
  • ****
  • Posts: 293
Component for stripping HTML?
« on: April 21, 2017, 06:07:06 pm »
Subject says it all.

I just need to strip all the HTML from incoming data. Mostly incoming data it is all in plain text, but sometimes I get an HTML page. I have no control over what is coming in, so I'd just like to strip out all the HTML and just see the text parts.

Short of including the "HtmlViewer-11.7" from GitHub, I have not yet found anything. I installed that HtmlViewer but it is too complicated for my simple needs. I do not want to spend a few days figuring it out just so I can strip out the HTML. :)

Helpful thoughts?

Thanks

Acer-i5, 2.6GHz, 6GB, 500GB-SSD, Mint-19.3, Cinnamon Desktop, Lazarus 2.0.6, SQLite3

Bart

  • Hero Member
  • *****
  • Posts: 5265
    • Bart en Mariska's Webstek
Re: Component for stripping HTML?
« Reply #1 on: April 21, 2017, 06:12:57 pm »
Simple approach: accept data until '<', drop data until '>'. Repeat until end of data.
You'll end up with plain text.

You could also use fasthtmlparser (it comes with fpc) to parse the text and use OnTag and OnText events to filter at will.

Bart

wp

  • Hero Member
  • *****
  • Posts: 11830
Re: Component for stripping HTML?
« Reply #2 on: April 21, 2017, 06:13:44 pm »
The fasthtmlparser from fpc packages\chm\src will do the job. Write a handler for its OnFoundText which fires whenever a text node is found within the html tree.

I once posted sample code for your task here, use the forum search

HatForCat

  • Sr. Member
  • ****
  • Posts: 293
Re: Component for stripping HTML?
« Reply #3 on: April 22, 2017, 12:23:58 am »
Thanks, but I am also getting some other stuff in some of HTML data. It is showing C++ style Remarks etc. Probably F# or worse.

I have had to mess with the HtmlViewer from GitHub.

Acer-i5, 2.6GHz, 6GB, 500GB-SSD, Mint-19.3, Cinnamon Desktop, Lazarus 2.0.6, SQLite3

wp

  • Hero Member
  • *****
  • Posts: 11830
Re: Component for stripping HTML?
« Reply #4 on: April 22, 2017, 12:26:15 am »
Can you zip one of these html files and upload it here to have a look?

mgear

  • Newbie
  • Posts: 4
Re: Component for stripping HTML?
« Reply #5 on: April 26, 2017, 01:50:42 am »
I do the alike task now. Now it's written on Perl with DOM/CSS parser Mojo::DOM. It works great but I need the embedded browser with automated interaction and so am moving to Pascal. There's the mighty DOM/XPath/XQuery/CSS parser there. Haven't tried it yet but it looks very promising.

Internet Tools been found here

For example:
Quote
At the lowest level you find the parseHTML function of the unit simplehtmlparser.

It just splits a html document into tags and text elements and calls a callback function for each of the elements.

« Last Edit: April 26, 2017, 01:52:54 am by mgear »

 

TinyPortal © 2005-2018