php crawler

Discussion in 'PHP' started by ssimon171078, Nov 7, 2014.

  1. #1
    i want to build php crawler to extract links from website ,i wrote code for link:
    https://www.tradebit.com/filesharing.php/1010-Documents-eBooks-Audio-Books-Teaching
    i receive links like:
    https://www.tradebit.com/filedetail.php/276643585-the-ultimate-plr-firesale-oto
    i want to create for all links from https://www.tradebit.com ,how to change this code:
    <?php
    // parser of website tradebit

    $i=1;
    $website="https://www.tradebit.com/filesharing.php/1010-Documents-eBooks-Audio-Books-Teaching";
    $filename="w.txt";

    while ($website){
    //echo $website ;
    $content=file_get_contents($website);

    $stripped_file = strip_tags($content, "<a>");
    //echo $stripped_file."<br>";

    //preg_match_all("/<a href=\"([^\"]*)\">(.*)<\/a>/iU",$content,$result);

    //print_r($result);
    //foreach ($result[1] as $line ){
    //echo $line . "<br />";


    //}
    preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER );
    foreach($matches as $match){
    $href = $match[1];

    $pos =strpos ($href,"filedetail");
    if ($pos!=0) {
    echo $href . "<br>"; }

    }
    $website=$website."/".$i++;

    sleep(5);

    }

    ?>
     
    ssimon171078, Nov 7, 2014 IP
  2. NetStar

    NetStar Notable Member

    Messages:
    2,471
    Likes Received:
    541
    Best Answers:
    21
    Trophy Points:
    245
    #2
    Not sure exactly what you are looking to do but the easiest way to scrape links is as follows:

    
    
    <?php
    
    $html = <<<EOF
    this is a test
    <a
    title="search" href="http://www.google.com">Google</a> this is a test
    this is a test
    EOF;
    
    $dom = new DOMDocument();
    
    $dom->loadhtml($html);
    
    $links = $dom->getElementsByTagName("a");
    
    foreach ($links as $link)
    {
      print $link->getAttribute("href"). "\n";
    }
    ?>
    
    PHP:
     
    NetStar, Nov 7, 2014 IP
  3. Anveto

    Anveto Well-Known Member

    Messages:
    697
    Likes Received:
    40
    Best Answers:
    19
    Trophy Points:
    195
    #3
    I would go with the example Netstar did but here is what you would need from your code, although your preg_match matches a bit more than it should perhaps

    
    $website="https://www.tradebit.com/filesharing.php/1010-Documents-eBooks-Audio-Books-Teaching";
    $content=file_get_contents($website);
    preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $content, $matches, PREG_SET_ORDER );
    foreach($matches as $match){
      $href = $match[1];
      if (strpos ($href,"filedetail")!==0) {
      echo $href . "<br>";
      }
    
    }
    
    PHP:
     
    Anveto, Nov 7, 2014 IP