Recent

Author Topic: Coding a Web Crawler  (Read 7308 times)

Jishaxe

  • Full Member
  • ***
  • Posts: 103
  • Hobbist Programmer
Coding a Web Crawler
« on: July 04, 2011, 01:43:59 pm »
Hello my friends.
I was thinking about creating a Web Crawler, just for fun.
The idea is I start the program with seed URLs, it parses the page for other URLs, and sends those to a queue. It then starts the process again but with the first page in the queue, as it does this it adds all the URLs visited in to a text file. It's going to be designed to run indefinitly, but of course it won't process the entire Web in my life :L
I can do a lot of this but what I need help with is parsing the page to find references to other URLs.
If someone could give me an algorithm to do this it would be great (but that would be lazy I know), so it would be nice if someone would give me advice on the best libraries and methods to achieve this goal, please ^^
Thank you,
Jishaxe
PS. Yes I am aware of legal issues, the software will abide by robots.txt :L
Linux Mint 12
Windows 7 Home Premium
______________________
Definition of programmer: An organism that converts caffeine into software.

mica

  • Full Member
  • ***
  • Posts: 196
Re: Coding a Web Crawler
« Reply #1 on: July 04, 2011, 05:55:39 pm »

Jishaxe

  • Full Member
  • ***
  • Posts: 103
  • Hobbist Programmer
Re: Coding a Web Crawler
« Reply #2 on: July 04, 2011, 05:57:20 pm »
Hm, I'll take a look. Thanks.
Linux Mint 12
Windows 7 Home Premium
______________________
Definition of programmer: An organism that converts caffeine into software.


 

TinyPortal © 2005-2018