Hello my friends.
I was thinking about creating a Web Crawler, just for fun.
The idea is I start the program with seed URLs, it parses the page for other URLs, and sends those to a queue. It then starts the process again but with the first page in the queue, as it does this it adds all the URLs visited in to a text file. It's going to be designed to run indefinitly, but of course it won't process the entire Web in my life :L
I can do a lot of this but what I need help with is parsing the page to find references to other URLs.
If someone could give me an algorithm to do this it would be great (but that would be lazy I know), so it would be nice if someone would give me advice on the best libraries and methods to achieve this goal, please ^^
Thank you,
Jishaxe
PS. Yes I am aware of legal issues, the software will abide by robots.txt :L