Recent

Author Topic: Google indexing of Lazarus forum posts  (Read 1734 times)

440bx

  • Hero Member
  • *****
  • Posts: 6364
Google indexing of Lazarus forum posts
« on: March 01, 2025, 06:19:18 am »
Hello,

Most of the time I used Google to search for Lazarus forum posts of interest because it has been mentioned that the forum "search" feature has a fair performance impact on the server.

Using Google as a replacement for the forum "search" used to work fairly well until about 3 weeks ago.  Starting about 3 weeks ago, it seems that Google has discarded quite a few forum posts from its indexes making it unsuitable as a replacement for the forum "search" function.

My question is: is this entirely Google's doing or has there been some change in how the forum is managed that may have caused this ? (e.g, restrictions on crawlers or some other thing that affects search engines.)



FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

TRon

  • Hero Member
  • *****
  • Posts: 4377
Re: Google indexing of Lazarus forum posts
« Reply #1 on: March 01, 2025, 04:57:53 pm »
Do you use a general google search of specially asking google to search this site ?

If I use duck duck go with the proper search terms your (this) thread shows up first.
Today is tomorrow's yesterday.

440bx

  • Hero Member
  • *****
  • Posts: 6364
Re: Google indexing of Lazarus forum posts
« Reply #2 on: March 01, 2025, 09:31:05 pm »
Depending on what I was searching for, it occasionally was necessary to tell Google to limit the search to the Lazarus forum.

In the great majority of cases, that doesn't seem to make any difference anymore.  IOW, it doesn't even know the posts exists, i.e. they are not indexed.

Maybe I'll switch to DuckDuckGo.

I'm somewhat concerned that Google no longer seems to index the Lazarus forum posts because most people use Google and its failure to index the forum posts lessens the visibility of the forum and by extension that of Lazarus and FPC.
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

Joanna

  • Hero Member
  • *****
  • Posts: 1440
Re: Google indexing of Lazarus forum posts
« Reply #3 on: March 01, 2025, 11:15:54 pm »
440bx It’s a sad state of affairs to become so dependent upon Google...when Google has no incentive to make this project a success.

The whole idea of some third party determining what you will find is not a good thing even if it started off as easy and convenient. Who knows what other things google is not finding?

Also I’ve heard of a growing trend in which AI is used to harvest content from websites so that it is no longer necessary to even visit the websites at all. The website content creators will become anonymous suppliers for a bot.

440bx

  • Hero Member
  • *****
  • Posts: 6364
Re: Google indexing of Lazarus forum posts
« Reply #4 on: March 02, 2025, 12:04:24 am »
440bx It’s a sad state of affairs to become so dependent upon Google...when Google has no incentive to make this project a success.
All internet users depend on some degree on search engines, Google or others.

It's kind of like, if you drive a car, you depend on the existence of roads (otherwise the car's usefulness is severely hampered.)

The above doesn't mean Google isn't misusing/"unethically using" the data it collects.  I don't have proof either way but, Google's ethical principles seem to be "flexible" (particularly when money is involved, which is most of the time.)

Anyway, their no longer indexing this forum's post won't benefit Lazarus nor FPC.  As you pointed out, that part is sad.
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

Josh

  • Hero Member
  • *****
  • Posts: 1455
Re: Google indexing of Lazarus forum posts
« Reply #5 on: March 02, 2025, 12:11:51 am »
Quick Diag.
The robots.txt in the root is blocking all search engines from indexing.
Code: Pascal  [Select][+][-]
  1. user-agent: *
  2. disallow: /index.php?*type=rss
  3. crawl-delay: 30
  4.  
  5. user-agent: YandexBot
  6. disallow: /index.php?*type=rss
  7. crawl-delay: 30
  8.  
  9. user-agent: Amazonbot
  10. disallow: /

Not sure why it has Crawl Delay Set though, as if it obeys the agent * delay has no effect.

Same for the ones after, as they will be disallowed because all agents are disallowed with *

I would allow google and a few others and disallow the rest  slurp is yahoo

Code: Pascal  [Select][+][-]
  1. User-agent: *
  2. Disallow: /
  3.  
  4. User-agent: Applebot
  5. Allow: /
  6. crawl-delay: 30
  7.  
  8. User-agent: baiduspider
  9. Allow: /
  10. crawl-delay: 30
  11.  
  12. User-agent: Bingbot
  13. Allow: /
  14. crawl-delay: 30
  15.  
  16. User-agent: DuckDuckBot
  17. Allow: /
  18. crawl-delay: 30
  19.  
  20. User-agent: Facebot
  21. Allow: /
  22. crawl-delay: 30
  23.  
  24. User-agent: Googlebot
  25. Allow: /
  26. crawl-delay: 30
  27.  
  28. User-agent: msnbot
  29. Allow: /
  30. crawl-delay: 30
  31.  
  32. User-agent: Naverbot
  33. Allow: /
  34. crawl-delay: 30
  35.  
  36. User-agent: seznambot
  37. Allow: /
  38. crawl-delay: 30
  39.  
  40. User-agent: Slurp
  41. Allow: /
  42. crawl-delay: 30
  43.  
  44. User-agent: teoma
  45. Allow: /
  46. crawl-delay: 30
  47.  
  48. User-agent: Twitterbot
  49. Allow: /
  50. crawl-delay: 30
  51.  
  52. User-agent: Yandex
  53. Allow: /
  54. crawl-delay: 30
  55.  
  56. User-agent: Yeti
  57. Allow: /
  58. crawl-delay: 30


On Pages like login / register / etc add the 'no-index' tag to the header  section or the meta tag, this tell the search engines not to crawl/index/follow these pages, its is not immediate as search engines have to re-crawl your entire site, some engines can take a few complete passes until they obey the change so be patient.

Note the delay of 30 with the amount of pages on the forum could stress the server, you could try increasing this to 100 or 300, the index will not be as up to date, but should help forum speed

 
// Block Specific pages from being crawled if these are fixed html/html files
User-agent: *
Disallow: /contactus.htm

// Block Directories that you dont want crawled etc, ie
User-agent: *
Disallow: /cgi-bin/
Disallow: /forum-user-attachments/
Disallow: /tmp/

Hope the above makes sense.

note robots.txt works fine for well behaved spiders, not all obey everything and some just completely ignore them.

if Using Apache, I would suggest updating the .htaccess file in the root to block these bots/spiders. by creating a rewrite rule. obviously check logs to see which bot it is and create a reqwite rule.
ie
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR]

I would not block ip or ranges, ligit bots stick to there ip, bad bots do not and generaaly they go through proxies, so can change there ip at will.

Addition:
If your using a sitemap.xml, if its set to update every 24hrs, increase to 72 hrs,
I had issues awhile back where a log file became so large, updates to it used somuch resources the site stalled, this file was over 30gb, even though it was deleted every7 days, this was not anticipated as normally it would be max 400mb, it increased by 9000% due to hacker attemps being logged. This was helped by using fail2ban, and logging to abuseipd, and using abuseipd as a blacklist check, it took a bit of configuring and fine tuning i think i ended up with 10 failed login attempts causing a block, and 3 various hack attemps causing a block.

baiduspider  china leeding search engine
Applebot      Apple
Bingbot        Bing
DuckDuckBot
Facebot        Facebook
Googlebot    Google
msnbot        MSN
Naverbot      South Korean
seznambot   Czech Republic
Slurp           Yahoo
teoma         ASK Search Engine
Twitterbot    Twitter
Yandex        Russian Search Engine
Yeti             South Korea

Added  DuckDuck to allowed list.
« Last Edit: March 02, 2025, 02:45:53 am by Josh »
The best way to get accurate information on the forum is to post something wrong and wait for corrections.

dbannon

  • Hero Member
  • *****
  • Posts: 3777
    • tomboy-ng, a rewrite of the classic Tomboy
Re: Google indexing of Lazarus forum posts
« Reply #6 on: March 02, 2025, 02:30:11 am »
Nice write up Josh !

I too use duck duck go, and I can find any phrase from 440bx's first post. That does sound to me (given Josh's words) as if DDG ignores robot.txt ?   Thats a touch disturbing I have to say.

I routinely use DDG to search the forum as its searching tends to be better. Don't really care about it not finding content less than 24h old, easier to browse the side bar for recent stuff.

Davo
Lazarus 3, Linux (and reluctantly Win10/11, OSX Monterey)
My Project - https://github.com/tomboy-notes/tomboy-ng and my github - https://github.com/davidbannon

GAN

  • Sr. Member
  • ****
  • Posts: 389
Linux Mint Mate (allways)
Zeos 7̶.̶2̶.̶6̶ 7.1.3a-stable - Sqlite - LazReport

440bx

  • Hero Member
  • *****
  • Posts: 6364
Re: Google indexing of Lazarus forum posts
« Reply #8 on: March 02, 2025, 03:14:35 am »
Nice write up Josh !
I second that !

I too use duck duck go, and I can find any phrase from 440bx's first post. That does sound to me (given Josh's words) as if DDG ignores robot.txt ?   Thats a touch disturbing I have to say.
Yes, it's nice the posts are indexed but, it is a bit worrisome that it may be ignoring robot.txt.

I routinely use DDG to search the forum as its searching tends to be better. Don't really care about it not finding content less than 24h old, easier to browse the side bar for recent stuff.
Yes, another reason I often used Google instead of the forum search was the results felt a bit better and it was usually faster too (in addition to not loading the web server with searches.)



Hi @440bx take a look: https://duckduckgo.com/?t=lm&q=Google+indexing+of+Lazarus+forum+posts&ia=web

Thank you GAN, I'll be using DDG for forum searches from now on.




I remain a bit concerned that Google isn't indexing the posts because that may negatively affect the visibility of Lazarus and FPC.  Not sure if there is anything the web admins can do about that.
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

Joanna

  • Hero Member
  • *****
  • Posts: 1440
Re: Google indexing of Lazarus forum posts
« Reply #9 on: March 02, 2025, 01:05:08 pm »
Quote
All internet users depend on some degree on search engines, Google or others. It's kind of like, if you drive a car, you depend on the existence of roads (otherwise the car's usefulness is severely hampered.)
440bx with the analogy cars it would be more like you have a map that only contains roads which benefit the map maker for people to travel down. If you want to find the roads not on the map you have to get out of your car and walk around looking for them which of course most people are Very unwilling to do because they believe that the map is accurate  :D

Josh

  • Hero Member
  • *****
  • Posts: 1455
Re: Google indexing of Lazarus forum posts
« Reply #10 on: March 02, 2025, 01:18:07 pm »
Hi

@dbannon, 440bx thanks.

Recently badbots are using the indexes/links etc  from ligit bots to scrape sites, this allows them scrape as the robots.txt is not present, a kind of back door in.

You will not stop all of them, all of the time, but you can make it hard for them.

the .htaccess can be very usefull, below i pieced together rules that only allow the bots from example earlier, and then on those block using the search query, so as not to overload database with querries and block them from certain folders.

Code: Pascal  [Select][+][-]
  1. RewriteEngine On
  2.  
  3. # Block all bots by default except for the allowed ones
  4. RewriteCond %{HTTP_USER_AGENT} !(baiduspider|Applebot|Bingbot|DuckDuckBot|Facebot|Googlebot|msnbot|Naverbot|seznambot|Slurp|teoma|Twitterbot|Yandex|Yeti) [NC]
  5. RewriteRule ^ - [F,L]
  6.  
  7. # Allow access for these bots (DuckDuckBot,Googlebot, Bingbot, etc.) and block them from accessing /search/ or specific folders
  8. RewriteCond %{HTTP_USER_AGENT} (baiduspider|Applebot|Bingbot|DuckDuckBot|Facebot|Googlebot|msnbot|Naverbot|seznambot|Slurp|teoma|Twitterbot|Yandex|Yeti) [NC]
  9.  
  10. # Block allowed bots from accessing /downloads/profiles/admin/users/stuff and other folders (e.g., '/downloads/',' /profiles/', '/admin/', '/users/stuff/') incl sub folders
  11. RewriteCond %{REQUEST_URI} ^/(downloads|profiles|admin|users/stuff)/ [NC]
  12. RewriteRule ^ - [F,L]
  13.  
  14. # Block allowed bots from accessing search with query string starting with s=
  15. RewriteCond %{QUERY_STRING} ^s= [OR]
  16. RewriteCond %{REQUEST_URI} ^/search/ [NC]
  17. RewriteRule ^ - [F,L]
  18.  
  19.  

Note you need to check the bot names,
« Last Edit: March 02, 2025, 02:11:09 pm by Josh »
The best way to get accurate information on the forum is to post something wrong and wait for corrections.

Joanna

  • Hero Member
  • *****
  • Posts: 1440
Re: Google indexing of Lazarus forum posts
« Reply #11 on: March 02, 2025, 02:50:35 pm »
I wonder how ai related bots choose which websites to scrape to begin with.. what is attracting them here?

Marc

  • Administrator
  • Hero Member
  • *
  • Posts: 2680
Re: Google indexing of Lazarus forum posts
« Reply #12 on: March 02, 2025, 10:34:10 pm »
Quick Diag.
The robots.txt in the root is blocking all search engines from indexing.
Code: Pascal  [Select][+][-]
  1. user-agent: *
  2. disallow: /index.php?*type=rss
  3. crawl-delay: 30
  4.  
  5. user-agent: YandexBot
  6. disallow: /index.php?*type=rss
  7. crawl-delay: 30
  8.  
  9. user-agent: Amazonbot
  10. disallow: /

Not sure why it has Crawl Delay Set though, as if it obeys the agent * delay has no effect.

Same for the ones after, as they will be disallowed because all agents are disallowed with *


No, the first section is not disalowing all spiders, only disallowing indexing the rss feed.

Yandex doesn't honour the wildcard so it has to be listed explicitly.
//--
{$I stdsig.inc}
//-I still can't read someones mind
//-Bugs reported here will be forgotten. Use the bug tracker

Marc

  • Administrator
  • Hero Member
  • *
  • Posts: 2680
Re: Google indexing of Lazarus forum posts
« Reply #13 on: March 02, 2025, 10:45:10 pm »
Google happened to get on the blocklist. I need to find out why
//--
{$I stdsig.inc}
//-I still can't read someones mind
//-Bugs reported here will be forgotten. Use the bug tracker

 

TinyPortal © 2005-2018