Web Crawling

Discussion in 'PHP' started by Thapar, Oct 18, 2009.

  1. #1
    As a part of my engineering project,I have to design a tourism website in which I get all the data by crawling various ( 4 or 5 ) sites and automatic databases.My website Data should be updated automatically whenever there is a modification in the sites being crawled.

    There is no constraint on the language to be used..but I would prefer using either PHP or Perl.
    Please suggest..which of the above two should I use ( as in which does the same quickly and is easy to gulp :))..and do mention some valid resources from where I can get an idea about the same.

    Regards -
    DT ( a programming amateur :))
     
    Thapar, Oct 18, 2009 IP
  2. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #2
    I'd recommend sticking with PHP and MySQL personally. If you are wanting to get some quick tutorials on PHP head over to w3schools.com and take a look there or tizag.com. Both are excellent. If you'd like video tutorials on PHP I'd recommend http://www.phpvideotutorials.com/
     
    JAY6390, Oct 18, 2009 IP
  3. Thapar

    Thapar Peon

    Messages:
    6
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    I have already used PHP for all those validations on the signup page and session id stuff.

    This is the configuration I am currently using -

    Apache Web Server
    MySQL-Database end
    PHP
    JavaScript

    The next phase is concerned with crawling 4 or 5 websites and getting the data.That is what I need an idea about.
    How to start..!:confused:
     
    Thapar, Oct 19, 2009 IP
  4. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #4
    Take a look here.
    There are some general ideas and hints about webcrawling.
     
    AsHinE, Oct 19, 2009 IP
  5. FCM

    FCM Well-Known Member

    Messages:
    669
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    155
    #5
    I prefer coldfusion.

    Also crawling a site for content would be harder then just using rss or xml feeds. I would try that first as your content would be more accurate and easier to manage, manipulate and store.
     
    FCM, Oct 19, 2009 IP
  6. Thapar

    Thapar Peon

    Messages:
    6
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    I had a look at the page you referred to..I have got some basic idea about how to crawl a website after seeing that stuff..:)
    BUT..what that person is trying to do and what I am trying to do is something I think kinda different.

    I don't have to parse the links appearing on the that html page..rather this is what I need -

    There is a political map of India that is there on my site..when a user clicks on a particular state..I have to crawl the predecided 4 or 5 websites..searching (most probably BFS/DFS) these sites for the keyword i.e. state's(the one on which user clicked) name and then I have to get only the REQUIRED data corresponding to that keyword.Do I need to use regex or preg_match and that kind of stuff or it can be done with something else(easier) too.
     
    Thapar, Oct 19, 2009 IP
  7. Thapar

    Thapar Peon

    Messages:
    6
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    FCM..even I know crawling a website for content is not everybody's cup of tea..:)..but the thing is I have been asked to do so in my Minor Project..so I have got no other option..:)
     
    Thapar, Oct 19, 2009 IP
  8. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #8
    I guess there are two ways of axtracting content:
    1. regexes
    2. domdocument and xpath queries.
     
    AsHinE, Oct 19, 2009 IP
  9. Thapar

    Thapar Peon

    Messages:
    6
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #9
    aah domdocument and xpath (hearing them for the very first time)..!:)
    any resources where to look upon..?
    I have heard about a class named PHPCrawler that is used for webcrawling too..
    Do you have any idea about it ..if yes..whether it is useful in the context I am looking for..?
     
    Thapar, Oct 19, 2009 IP
  10. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #10
    AsHinE, Oct 19, 2009 IP