Recent

Author Topic: logon to a website using fphttpclient  (Read 10535 times)

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: logon to a website using fphttpclient
« Reply #45 on: July 30, 2025, 06:54:10 pm »
I'm back, been offline for a few days.

Now with FPC 3.2.4 and rvk's latest code example, it works. I can batch download and store forum pages into local html files. Brillant !

What remains to do for me - a single html page is around 200k in size, 99% of which is page layout, formatting, links and stuff I don't need. I want to get rid of that and extract the relevant information as plain text. Probably will write a simple filter program to do this. Or is there a html processing unit to assist ?

Anyway, thanks a lot for the support ! Frankly I had not hoped to get that far.

rvk

  • Hero Member
  • *****
  • Posts: 6991
Re: logon to a website using fphttpclient
« Reply #46 on: July 30, 2025, 08:12:25 pm »
What remains to do for me - a single html page is around 200k in size, 99% of which is page layout, formatting, links and stuff I don't need. I want to get rid of that and extract the relevant information as plain text. Probably will write a simple filter program to do this. Or is there a html processing unit to assist ?
For new posts on the forum (and subforums) you can use /external?type=xml or rss/rss2.
For example:
https://forum.vbulletin.com/external?type=xml
https://forum.vbulletin.com/external?type=xml&nodeid=40

Unfortunately you can't do similar with topics itself. So yes, you need to scrape them yourself (including next pages etc).
Sometimes there is a plugin for "clean print" (it was default installed in vBulletin upto version 3 but not standard in later versions).

You can parse the HTML file the usual way in FPC. (search for html dom here on the forum).

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: logon to a website using fphttpclient
« Reply #47 on: August 02, 2025, 10:49:10 am »
Now the following happened:

I scraped a few pages - not a lot really, I am very careful not to trigger any alarms.

When I then tried to visit the winamp forum via firefox, I got a cloudflare "confirm you are human" captcha for the first time ever.

Well, I had not changed the useragent ID in my program, so I thought cloudflare may have detected two different IDs from the same IP. So I asked google for my browser's ID, and copied that into the code:

Code: Pascal  [Select][+][-]
  1.  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0'

But surprisingly, that gives me error 403 consistently. But it works with rvk's ID. What could be the reason for that ?

EDIT: I just noted that now I am getting this captcha each time I try to login to the forum from my browser... it seems that my browser's ID is on cloudflare's list
« Last Edit: August 02, 2025, 12:05:06 pm by Nitorami »

rvk

  • Hero Member
  • *****
  • Posts: 6991
Re: logon to a website using fphttpclient
« Reply #48 on: August 02, 2025, 01:14:49 pm »
EDIT: I just noted that now I am getting this captcha each time I try to login to the forum from my browser... it seems that my browser's ID is on cloudflare's list
It could be that in time that captcha goes away for you in the browser.

But as I stated in the beginning... protection through Cloudflare is really difficult to beat.
It could be that the 403 you're getting is the captcha page. In that case the program should switch to 'human' mode and tell Cloudflare it's human (which of course is a fantasy) 😂

https://www.zenrows.com/blog/bypass-cloudflare#fortified-headless-browsers

You could try to find out the original ip of forum.winamp.com. (not the cloadflare ip)
(See method #3 in above article)

Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: logon to a website using fphttpclient
« Reply #49 on: August 02, 2025, 02:41:22 pm »
The funny thing is, I am not trying to do anything illegal and am neither planning a DDoS campaign or anything that would have any noticeable impact on the server infrastructure.
The forum these days attracts around 5 legit members, old farts like me, while the user stats page usually shows 200 ... 20000 "guests". According to NJK these are mainly AI scrapers, google and amazon crawlers. And around 1000 new spammers per month, but, while a nuisance, these hardly contribute to the active user stats.
What is the point of cloudflare if they just manage to annoy harmless tinkerers like me, while letting through the large-style professional scrapers ?

rvk

  • Hero Member
  • *****
  • Posts: 6991
Re: logon to a website using fphttpclient
« Reply #50 on: August 02, 2025, 03:05:09 pm »
The funny thing is, I am not trying to do anything illegal and am neither planning a DDoS campaign or anything that would have any noticeable impact on the server infrastructure.
forums.winamp.com is hosted by vbulletin.net (as are many other forums). You can see that because winamp.vbulletin.net exists as hostname (also pointing to cloudflare). There are hundreds of these subdomains (one per hosted forum).

So the whole cloudflare thing is handled (or ordered) by vbulletin.net.

You say you're not doing anything against the rules when scaping the forum. But the chances are this does violate their terms of service (at least the ones from vbulletin.net).

So winamp might be a small forum... it falls under the whole umbrella of vbulletin and they have implemented strict access (via cloudflare).

BTW. The professional scrapers also have problems... or they are (individual) allowed because of searchability (by cloudflare). And yes, with millions of scrapers... sometimes a thousand get through... a small percentage.


Nitorami

  • Hero Member
  • *****
  • Posts: 605
Re: logon to a website using fphttpclient
« Reply #51 on: August 02, 2025, 03:39:53 pm »
Ok. But oddly, they have my browser under suspicion - on almost every page I try to access, I now get a captcha - while my "scraping" program still works (with your user agent ID).

So this seems not related to winamp.com, and maybe this is just coincidence. The cloudflare action might have been triggered by something entirely different outside my reach.

 

TinyPortal © 2005-2018