php crawler

ssimon171078 Well-Known Member

Messages:: 277

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 103

#1

i want to build php crawler to extract links from website ,i wrote code for link:
https://www.tradebit.com/filesharing.php/1010-Documents-eBooks-Audio-Books-Teaching
i receive links like:
https://www.tradebit.com/filedetail.php/276643585-the-ultimate-plr-firesale-oto
i want to create for all links from https://www.tradebit.com ,how to change this code:
<?php
// parser of website tradebit

$i=1;
$website="https://www.tradebit.com/filesharing.php/1010-Documents-eBooks-Audio-Books-Teaching";
$filename="w.txt";

while ($website){
//echo $website ;
$content=file_get_contents($website);

$stripped_file = strip_tags($content, "<a>");
//echo $stripped_file."<br>";

//preg_match_all("/<a href=\"([^\"]*)\">(.*)<\/a>/iU",$content,$result);

//print_r($result);
//foreach ($result[1] as $line ){
//echo $line . "<br />";

//}
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER );
foreach($matches as $match){
$href = $match[1];

$pos =strpos ($href,"filedetail");
if ($pos!=0) {
echo $href . "<br>"; }

}
$website=$website."/".$i++;

sleep(5);

}

?>

ssimon171078, Nov 7, 2014 IP

NetStar Notable Member

Messages:: 2,471

Likes Received:: 541

Best Answers:: 21

Trophy Points:: 245

#2

Not sure exactly what you are looking to do but the easiest way to scrape links is as follows:



<?php

$html = <<<EOF
this is a test
<a
title="search" href="http://www.google.com">Google</a> this is a test
this is a test
EOF;

$dom = new DOMDocument();

$dom->loadhtml($html);

$links = $dom->getElementsByTagName("a");

foreach ($links as $link)
{
  print $link->getAttribute("href"). "\n";
}
?>

PHP:

NetStar, Nov 7, 2014 IP

Anveto Well-Known Member

Messages:: 697

Likes Received:: 40

Best Answers:: 19

Trophy Points:: 195

#3

I would go with the example Netstar did but here is what you would need from your code, although your preg_match matches a bit more than it should perhaps


$website="https://www.tradebit.com/filesharing.php/1010-Documents-eBooks-Audio-Books-Teaching";
$content=file_get_contents($website);
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $content, $matches, PREG_SET_ORDER );
foreach($matches as $match){
  $href = $match[1];
  if (strpos ($href,"filedetail")!==0) {
  echo $href . "<br>";
  }

}

PHP:

Anveto, Nov 7, 2014 IP

Log in or Sign up

php crawler

ssimon171078 Well-Known Member

NetStar Notable Member

Anveto Well-Known Member

Useful Searches