How do I Scrape Content Off Another Site?

Discussion in 'PHP' started by Shadowplay, Oct 26, 2008.

  1. #1
    I don't have any PHP experience or anything like that to write my own script to gather text off another site. What should I use to do this? Is there a program for someone like me without a lot of programming experience?
     
    Shadowplay, Oct 26, 2008 IP
  2. Sillysoft

    Sillysoft Active Member

    Messages:
    177
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    58
    #2
    Snoopy class is a good one for scraping off another site.
     
    Sillysoft, Oct 26, 2008 IP
  3. mehdi

    mehdi Peon

    Messages:
    258
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    0
    #3
    To get HTML codes of another website use file_get_contents its pretty easy.

    For example:
    
    <?php
    $siteurl="http://anysite.com";
    $getsite=file_get_contents($siteurl);
    echo $getsite;
    ?>
    
    PHP:
    Hope it helps.
     
    mehdi, Oct 27, 2008 IP
  4. six.sigma

    six.sigma Peon

    Messages:
    42
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #4
    There's a lot of functions, like "curl" that allows you to download HTML of a certain site.
    Then you can work with that HTML inside a variable, to filter, modify or extract information into several other variables and then print the results in your site, following your layout and design.
     
    six.sigma, Oct 31, 2008 IP
  5. happpy

    happpy Well-Known Member

    Messages:
    926
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    120
    #5
    learn the following functions and you can already achieve a lot:

    file_get_contents()
    explode()
    eregi_replace()

    make yourself familar what arrays are and how to reverse and sort them.

    php is a very big language, but you can achieve a lot with some of the most basic commands.
     
    happpy, Oct 31, 2008 IP
  6. Conello

    Conello Member

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    45
    #6
    Curl is very recommended script for you to learn, because many hosting not allowed file function such us file_get_contents($siteurl); to run on their host.

    A simple curl application to grab the page:

    
    <?php
      $url_l = 'http://google.com';
      $c2 = curl_init();
      curl_setopt( $c2, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT'] );
      curl_setopt( $c2, CURLOPT_FOLLOWLOCATION, 1 );
      curl_setopt ($c2, CURLOPT_HEADER,1);
      curl_setopt($c2, CURLOPT_URL, $url_l);
      curl_setopt($c2, CURLOPT_RETURNTRANSFER, true);
      $output2 = curl_exec($c2);
      curl_close($c2);
    
      // do some extraction using explode, strstr, str_replace, eregi_replace, etc
      echo $output2;
    ?>
    
    Code (markup):
     
    Conello, Nov 1, 2008 IP
  7. exodus

    exodus Well-Known Member

    Messages:
    1,900
    Likes Received:
    35
    Best Answers:
    0
    Trophy Points:
    165
    #7
    Check out this. It's called htmlSQL it takes html pages in the direction of doing mysql query's and it very useful for scraping information from other websites.

    http://www.jonasjohn.de/lab/htmlsql.htm
     
    exodus, Nov 1, 2008 IP
  8. techcone

    techcone Banned

    Messages:
    206
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Curl is the master of all scraping languages :)
     
    techcone, Nov 1, 2008 IP
  9. Calon

    Calon Peon

    Messages:
    25
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #9
    You don't even need PHP to do it, but you can use curl.
     
    Calon, Nov 2, 2008 IP
  10. happpy

    happpy Well-Known Member

    Messages:
    926
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    120
    #10
    curl is no language, curl is a tool :)

    curl can crawl anything and mimic a real users browser, but to weed out the contents you have to use PHP or some other stringhandling-able scripting language or facility.
     
    happpy, Nov 2, 2008 IP