@franzala
As you have been told, what you want to achieve isn't simple. But if you're already good in string manipulation and data storing in Pascal, you may be interested to try a very simple demo posted by @wp:
https://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199I followed the demo, and later created my
web forum spam detection program. But because of the limitation of my knowledge, it is half-finished only.
First, using the demo I wrote a very simple web browser. It run extremely slow, so I added my own image caching routine. Much better, unfortunately I can't render the CSS things. By analyzing the website I want to 'scrap', I was able to make the program to follow the links of the pages automatically. But because the website has enormous pages, I had to define the range for scrapping on each run. And I had to slow it down by putting some delay for each pages because some websites or its webhosting's firewall will block incoming users (bots) that tried to request too many pages in a certain of time.
The fetched data is stored in a dbf file, I chose dbf because it needs zero installation and it is capable for the thing I want to do. On each run, the program will download only a certain range of pages and it will update if the pages were already stored. And then I wrote some code to find all the outbound links in the stored data and compare them with my defined blacklist and whitelist keywords. And show a report of pages that contain suspicious spam links.
It sounds simple, just 2 paragraphs to explain it. But actually it took me months to write it. And it became too hard for me so I take a break.
I've been writing a program for spam detection on this forum.
I do not have the license of the data. But it's just for personal use, so I think it is okay.