* * *

Author Topic: Component for stripping HTML?  (Read 546 times)

HatForCat

  • Sr. Member
  • ****
  • Posts: 260
Component for stripping HTML?
« on: April 21, 2017, 06:07:06 pm »
Subject says it all.

I just need to strip all the HTML from incoming data. Mostly incoming data it is all in plain text, but sometimes I get an HTML page. I have no control over what is coming in, so I'd just like to strip out all the HTML and just see the text parts.

Short of including the "HtmlViewer-11.7" from GitHub, I have not yet found anything. I installed that HtmlViewer but it is too complicated for my simple needs. I do not want to spend a few days figuring it out just so I can strip out the HTML. :)

Helpful thoughts?

Thanks

Acer-i5, 2.6GHz, 6GB, 256-SSD, Ubuntu 14.04-LTS, Unity Desktop, Lazarus 1.6.2, SQLite3 -- Retired: Programming for my own use for Ubuntu.

Bart

  • Hero Member
  • *****
  • Posts: 2654
    • Bart en Mariska's Webstek
Re: Component for stripping HTML?
« Reply #1 on: April 21, 2017, 06:12:57 pm »
Simple approach: accept data until '<', drop data until '>'. Repeat until end of data.
You'll end up with plain text.

You could also use fasthtmlparser (it comes with fpc) to parse the text and use OnTag and OnText events to filter at will.

Bart

wp

  • Hero Member
  • *****
  • Posts: 3444
Re: Component for stripping HTML?
« Reply #2 on: April 21, 2017, 06:13:44 pm »
The fasthtmlparser from fpc packages\chm\src will do the job. Write a handler for its OnFoundText which fires whenever a text node is found within the html tree.

I once posted sample code for your task here, use the forum search
Lazarus trunk / fpc 3.0.0 / Win32

HatForCat

  • Sr. Member
  • ****
  • Posts: 260
Re: Component for stripping HTML?
« Reply #3 on: April 22, 2017, 12:23:58 am »
Thanks, but I am also getting some other stuff in some of HTML data. It is showing C++ style Remarks etc. Probably F# or worse.

I have had to mess with the HtmlViewer from GitHub.

Acer-i5, 2.6GHz, 6GB, 256-SSD, Ubuntu 14.04-LTS, Unity Desktop, Lazarus 1.6.2, SQLite3 -- Retired: Programming for my own use for Ubuntu.

wp

  • Hero Member
  • *****
  • Posts: 3444
Re: Component for stripping HTML?
« Reply #4 on: April 22, 2017, 12:26:15 am »
Can you zip one of these html files and upload it here to have a look?
Lazarus trunk / fpc 3.0.0 / Win32

mgear

  • Newbie
  • Posts: 4
Re: Component for stripping HTML?
« Reply #5 on: April 26, 2017, 01:50:42 am »
I do the alike task now. Now it's written on Perl with DOM/CSS parser Mojo::DOM. It works great but I need the embedded browser with automated interaction and so am moving to Pascal. There's the mighty DOM/XPath/XQuery/CSS parser there. Haven't tried it yet but it looks very promising.

Internet Tools been found here

For example:
Quote
At the lowest level you find the parseHTML function of the unit simplehtmlparser.

It just splits a html document into tags and text elements and calls a callback function for each of the elements.

« Last Edit: April 26, 2017, 01:52:54 am by mgear »

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus