As a part of my engineering project,I have to design a tourism website in which I get all the data by crawling various ( 4 or 5 ) sites and automatic databases.My website Data should be updated automatically whenever there is a modification in the sites being crawled. There is no constraint on the language to be used..but I would prefer using either PHP or Perl. Please suggest..which of the above two should I use ( as in which does the same quickly and is easy to gulp )..and do mention some valid resources from where I can get an idea about the same. Regards - DT ( a programming amateur )
I'd recommend sticking with PHP and MySQL personally. If you are wanting to get some quick tutorials on PHP head over to w3schools.com and take a look there or tizag.com. Both are excellent. If you'd like video tutorials on PHP I'd recommend http://www.phpvideotutorials.com/
I have already used PHP for all those validations on the signup page and session id stuff. This is the configuration I am currently using - Apache Web Server MySQL-Database end PHP JavaScript The next phase is concerned with crawling 4 or 5 websites and getting the data.That is what I need an idea about. How to start..!
I prefer coldfusion. Also crawling a site for content would be harder then just using rss or xml feeds. I would try that first as your content would be more accurate and easier to manage, manipulate and store.
I had a look at the page you referred to..I have got some basic idea about how to crawl a website after seeing that stuff.. BUT..what that person is trying to do and what I am trying to do is something I think kinda different. I don't have to parse the links appearing on the that html page..rather this is what I need - There is a political map of India that is there on my site..when a user clicks on a particular state..I have to crawl the predecided 4 or 5 websites..searching (most probably BFS/DFS) these sites for the keyword i.e. state's(the one on which user clicked) name and then I have to get only the REQUIRED data corresponding to that keyword.Do I need to use regex or preg_match and that kind of stuff or it can be done with something else(easier) too.
FCM..even I know crawling a website for content is not everybody's cup of tea....but the thing is I have been asked to do so in my Minor Project..so I have got no other option..
aah domdocument and xpath (hearing them for the very first time)..! any resources where to look upon..? I have heard about a class named PHPCrawler that is used for webcrawling too.. Do you have any idea about it ..if yes..whether it is useful in the context I am looking for..?
About domdocument and xpath how to get links from webpage with xpath and domdocument I've never heard of PHPCrawler.