Web Crawling

Thapar Peon

Messages:: 6

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

As a part of my engineering project,I have to design a tourism website in which I get all the data by crawling various ( 4 or 5 ) sites and automatic databases.My website Data should be updated automatically whenever there is a modification in the sites being crawled.

There is no constraint on the language to be used..but I would prefer using either PHP or Perl.
Please suggest..which of the above two should I use ( as in which does the same quickly and is easy to gulp )..and do mention some valid resources from where I can get an idea about the same.

Regards -
DT ( a programming amateur )

Thapar, Oct 18, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#2

I'd recommend sticking with PHP and MySQL personally. If you are wanting to get some quick tutorials on PHP head over to w3schools.com and take a look there or tizag.com. Both are excellent. If you'd like video tutorials on PHP I'd recommend http://www.phpvideotutorials.com/

JAY6390, Oct 18, 2009 IP

Thapar Peon

Messages:: 6

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

I have already used PHP for all those validations on the signup page and session id stuff.

This is the configuration I am currently using -

Apache Web Server
MySQL-Database end
PHP
JavaScript

The next phase is concerned with crawling 4 or 5 websites and getting the data.That is what I need an idea about.
How to start..!

Thapar, Oct 19, 2009 IP

AsHinE Well-Known Member

Messages:: 240

Likes Received:: 8

Best Answers:: 1

Trophy Points:: 138

#4

Take a look here.
There are some general ideas and hints about webcrawling.

AsHinE, Oct 19, 2009 IP

FCM Well-Known Member

Messages:: 669

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 155

#5

I prefer coldfusion.

Also crawling a site for content would be harder then just using rss or xml feeds. I would try that first as your content would be more accurate and easier to manage, manipulate and store.

FCM, Oct 19, 2009 IP

Thapar Peon

Messages:: 6

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#6

I had a look at the page you referred to..I have got some basic idea about how to crawl a website after seeing that stuff..
BUT..what that person is trying to do and what I am trying to do is something I think kinda different.

I don't have to parse the links appearing on the that html page..rather this is what I need -

There is a political map of India that is there on my site..when a user clicks on a particular state..I have to crawl the predecided 4 or 5 websites..searching (most probably BFS/DFS) these sites for the keyword i.e. state's(the one on which user clicked) name and then I have to get only the REQUIRED data corresponding to that keyword.Do I need to use regex or preg_match and that kind of stuff or it can be done with something else(easier) too.

Thapar, Oct 19, 2009 IP

Thapar Peon

Messages:: 6

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#7

FCM..even I know crawling a website for content is not everybody's cup of tea....but the thing is I have been asked to do so in my Minor Project..so I have got no other option..

Thapar, Oct 19, 2009 IP

AsHinE Well-Known Member

Messages:: 240

Likes Received:: 8

Best Answers:: 1

Trophy Points:: 138

#8

I guess there are two ways of axtracting content:
1. regexes
2. domdocument and xpath queries.

AsHinE, Oct 19, 2009 IP

Thapar Peon

Messages:: 6

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#9

aah domdocument and xpath (hearing them for the very first time)..!
any resources where to look upon..?
I have heard about a class named PHPCrawler that is used for webcrawling too..
Do you have any idea about it ..if yes..whether it is useful in the context I am looking for..?

Thapar, Oct 19, 2009 IP

AsHinE Well-Known Member

Messages:: 240

Likes Received:: 8

Best Answers:: 1

Trophy Points:: 138

#10

About domdocument and xpath how to get links from webpage with xpath and domdocument
I've never heard of PHPCrawler.

AsHinE, Oct 19, 2009 IP

Log in or Sign up

Web Crawling

Thapar Peon

JAY6390 Peon

Thapar Peon

AsHinE Well-Known Member

FCM Well-Known Member

Thapar Peon

Thapar Peon

AsHinE Well-Known Member

Thapar Peon

AsHinE Well-Known Member

Useful Searches