Hi, I am trying to get info from some pages, but I am not having much luck. it seems the get parameters are not getting passed to my curl script to call the right page. <?php $start=microtime(true); require_once('db_config.php'); $url='http://www.somedomain.com/links.php'; $type='free'; $page_num=3; $url=$url.'?type='.$type.'&page='.$page_num; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $get_page = curl_exec($ch); curl_close($ch); FILE_PUT_CONTENTS('page_'.$page_num.'.txt', $get_page); $end=microtime(true); $total=$end - $start; echo '</br>'.$total; ?> Code (markup): When I echo the $url variable, it looks right on the screen, but maybe I am not quite understanding something with cURL functions, and I can't just attach the get parameters to the end of the url. I need to do it this way, because I will need to make a loop to cycle through different parameter values. When i do type it all out, and pass it to cURL in the $url variable, it works fine. I also do not get anything if I try to echo the $get_page variable right after the cURL commands. Any suggestions? Thanks, Michael
I tested it on my private server, and it seems to work ok. Are you sure that no POST variables need to be passed?
I am positive there are no post parameters neede to be sent. I can do $url='http://www.somesite.com/links.php?type=free&pagenumber=4'; Code (markup): and it will go to that url. Is it possible the script is moving to fast,and I need to put a sleep() at the end of each loop? Thanks, Michael
We won't be able to help you without having the actual URL you are working with. Try adding CURLOPT_FOLLOWLOCATION ( 1 ) and see if anything changes.
Okay, I will try that and post back with the results. I didn't know if I should post the actual url I am working with or not, but if your idea don't work, I could post the complete script, maybe I am just missing something else. Thanks, Michael
Hi, That didn't help. It still only goes to the first page of the links i am trying to scrape from the site. Below is the complete code. Maybe it is something small I am just not catching, it has happened before. <?php $start=microtime(true); set_time_limit(0); ignore_user_abort(); require_once('db_config.php'); $site=''; $add=''; $url='http://www.onewaytextlink.com/links.php'; $type='free'; $page_num=1; $url=$url.'?type='.$type.'&page='.$page_num; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_TIMEOUT, 30); $scraped = curl_exec($ch); curl_close($ch); FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped); preg_match_all('/<a href="\/links.php\?type=free\&pagenum=(.*)">(\d)<\/a>/', $scraped, $pages, PREG_SET_ORDER); $newArr = array(); foreach ($pages as $val) { $newArr[$val[2]] = $val; } $pages = array_values($newArr); $page_count=count($pages); $page_count++; for ($row1 = 0; $row1 < $page_count; $row1++) { $url=$url.'?type='.$type.'&page='.$page_num; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_TIMEOUT, 30); $scraped = curl_exec($ch); curl_close($ch); FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped); preg_match_all('/url=(.*)" target="_blank">(.*)<\/a>/i', $scraped, $links, PREG_SET_ORDER); $rowcount=count($links); for ($row2 = 0; $row2 < $rowcount; $row2++) { unset($links[$row2][0]); if(!preg_match('/^http:\/\//i', $links[$row2][1])) { $links[$row2][1] = 'http://'.$links[$row2][1]; } $links[$row2][1]=preg_replace('/%2f/i', '/', $links[$row2][1]); $urls=$links[$row2][1]; $title=$links[$row2][2]; $sql="SELECT * FROM website_directory WHERE url='$urls'"; $q=MYSQLI_QUERY($link, $sql); if(MYSQLI_FETCH_ASSOC($q) == 0) { $insert="INSERT INTO website_directory (title, url)VALUE('$title', '$urls')"; $q=MYSQLI_QUERY($link, $insert); $add++; } $site++; } $page_num++; } echo $add.' sites have been added to the database.</br>'; echo $site.' have been scanned.</br>'; $end=microtime(true); $total=$end - $start; echo '</br>'.$total; ?> Code (markup): Thanks for any assistance. Michael
There you go :- <?php $start=microtime(true); set_time_limit(0); ignore_user_abort(); require_once('db_config.php'); $site=''; $add=''; $url='http://www.onewaytextlink.com/links.php'; $type='free'; $page_num=1; $url=$url.'?type='.$type.'&pagenum='.$page_num; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_TIMEOUT, 30); $scraped = curl_exec($ch); curl_close($ch); FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped); preg_match_all('/<a href="\/links.php\?type=free\&pagenum=(.*)">(\d)<\/a>/', $scraped, $pages, PREG_SET_ORDER); $newArr = array(); foreach ($pages as $val) { $newArr[$val[2]] = $val; } print_r($pages); $pages = array_values($newArr); $page_count=count($pages); $page_count++; for ($row1 = 0; $row1 < $page_count; $row1++) { $url=$url.'?type='.$type.'&pagenum='.$page_num; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_TIMEOUT, 30); $scraped = curl_exec($ch); curl_close($ch); FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped); preg_match_all('/url=(.*)" target="_blank">(.*)<\/a>/i', $scraped, $links, PREG_SET_ORDER); $rowcount=count($links); for ($row2 = 0; $row2 < $rowcount; $row2++) { unset($links[$row2][0]); if(!preg_match('/^http:\/\//i', $links[$row2][1])) { $links[$row2][1] = 'http://'.$links[$row2][1]; } $links[$row2][1]=preg_replace('/%2f/i', '/', $links[$row2][1]); $urls=$links[$row2][1]; $title=$links[$row2][2]; $sql="SELECT * FROM website_directory WHERE url='$urls'"; echo $sql."\n"; $q=MYSQLI_QUERY($link, $sql); if(MYSQLI_FETCH_ASSOC($q) == 0) { $insert="INSERT INTO website_directory (title, url)VALUE('$title', '$urls')"; $q=MYSQLI_QUERY($link, $insert); $add++; } $site++; } $page_num++; } echo $add.' sites have been added to the database.</br>'; echo $site.' have been scanned.</br>'; $end=microtime(true); $total=$end - $start; echo '</br>'.$total; ?> PHP:
The problem is fairly simple: $url='http://www.onewaytextlink.com/links.php'; //url =http://www.onewaytextlink.com/links.php PHP: $url=$url.'?type='.$type.'&pagenum='.$page_num; //url=http://www.onewaytextlink.com/links.php?type=0&pagenum=0 PHP: //url=http://www.onewaytextlink.com/links.php?type=0&pagenum=0 And again: $url=$url.'?type='.$type.'&pagenum='.$page_num; //url=http://www.onewaytextlink.com/links.php?type=0&pagenum=0?type=1&pagenum=1 PHP: The get variables are added to the get variables from the request done before that. Simple fix: <?php $start=microtime(true); set_time_limit(0); ignore_user_abort(); require_once('db_config.php'); $site=''; $add=''; $baseurl='http://www.onewaytextlink.com/links.php'; $type=''; $page_num=1; $url=$baseurl.'?type='.$type.'&pagenum='.$page_num; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_TIMEOUT, 30); $scraped = curl_exec($ch); curl_close($ch); FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped); preg_match_all('/<a href="\/links.php\?type=free\&pagenum=(.*)">(\d)<\/a>/', $scraped, $pages, PREG_SET_ORDER); $newArr = array(); foreach ($pages as $val) { $newArr[$val[2]] = $val; } print_r($pages); $pages = array_values($newArr); $page_count=count($pages); $page_count++; for ($row1 = 0; $row1 < $page_count; $row1++) { $url=$baseurl.'?type='.$type.'&pagenum='.$page_num; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_TIMEOUT, 30); $scraped = curl_exec($ch); curl_close($ch); FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped); preg_match_all('/url=(.*)" target="_blank">(.*)<\/a>/i', $scraped, $links, PREG_SET_ORDER); $rowcount=count($links); for ($row2 = 0; $row2 < $rowcount; $row2++) { unset($links[$row2][0]); if(!preg_match('/^http:\/\//i', $links[$row2][1])) { $links[$row2][1] = 'http://'.$links[$row2][1]; } $links[$row2][1]=preg_replace('/%2f/i', '/', $links[$row2][1]); $urls=$links[$row2][1]; $title=$links[$row2][2]; $sql="SELECT * FROM website_directory WHERE url='$urls'"; echo $sql."\n"; $q=MYSQLI_QUERY($link, $sql); if(MYSQLI_FETCH_ASSOC($q) == 0) { $insert="INSERT INTO website_directory (title, url)VALUE('$title', '$urls')"; $q=MYSQLI_QUERY($link, $insert); $add++; } $site++; } $page_num++; } echo $add.' sites have been added to the database.</br>'; echo $site.' have been scanned.</br>'; $end=microtime(true); $total=$end - $start; echo '</br>'.$total; PHP:
Hi, Thanks to both of you for finding a problem. It took me a while to figure out what nont had changed, but after I got it, well, it just shows how something so simple can be over looked. Thanks, ssmm987 for figuring out that the get was being appended after the other get parameters. I should have caught that mistake to. Works great now, thanks again. Michael
if i put a special caractere to authentificate a user using GET parameter this not work exemple of pass : Camilie& the & in the Get url do a big problem Sorry fo my english and thanks for replay