cURL with get parameters

mbaldwin Active Member

Messages:: 215

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 95

#1

Hi,
I am trying to get info from some pages, but I am not having much luck. it seems the get parameters are not getting passed to my curl script to call the right page.
<?php
$start=microtime(true);
require_once('db_config.php');
$url='http://www.somedomain.com/links.php';
$type='free';
$page_num=3;
$url=$url.'?type='.$type.'&page='.$page_num;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$get_page = curl_exec($ch);
curl_close($ch);
FILE_PUT_CONTENTS('page_'.$page_num.'.txt', $get_page);
$end=microtime(true);
$total=$end - $start;
echo '</br>'.$total;
?>
Code (markup):
When I echo the $url variable, it looks right on the screen, but maybe I am not quite understanding something with cURL functions, and I can't just attach the get parameters to the end of the url.

I need to do it this way, because I will need to make a loop to cycle through different parameter values.

When i do type it all out, and pass it to cURL in the $url variable, it works fine.
I also do not get anything if I try to echo the $get_page variable right after the cURL commands.

Any suggestions?

Thanks,
Michael

mbaldwin, Aug 20, 2011 IP

ssmm987 Member

Messages:: 180

Likes Received:: 4

Best Answers:: 3

Trophy Points:: 43

#2

I tested it on my private server, and it seems to work ok.

Are you sure that no POST variables need to be passed?

ssmm987, Aug 21, 2011 IP

mbaldwin Active Member

Messages:: 215

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 95

#3

I am positive there are no post parameters neede to be sent. I can do
$url='http://www.somesite.com/links.php?type=free&pagenumber=4';
Code (markup):
and it will go to that url.

Is it possible the script is moving to fast,and I need to put a sleep() at the end of each loop?

Thanks,
Michael

mbaldwin, Aug 21, 2011 IP

iBank ™ Peon

Messages:: 63

Likes Received:: 4

Best Answers:: 1

Trophy Points:: 0

#4

We won't be able to help you without having the actual URL you are working with. Try adding CURLOPT_FOLLOWLOCATION ( 1 ) and see if anything changes.

iBank ™, Aug 21, 2011 IP

mbaldwin Active Member

Messages:: 215

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 95

#5

Okay, I will try that and post back with the results.

I didn't know if I should post the actual url I am working with or not, but if your idea don't work, I could post the complete script, maybe I am just missing something else.

Thanks,
Michael

mbaldwin, Aug 21, 2011 IP

mbaldwin Active Member

Messages:: 215

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 95

#6

Hi,
That didn't help. It still only goes to the first page of the links i am trying to scrape from the site. Below is the complete code. Maybe it is something small I am just not catching, it has happened before.


<?php
$start=microtime(true);
set_time_limit(0);
ignore_user_abort();
require_once('db_config.php');
$site='';
$add='';
$url='http://www.onewaytextlink.com/links.php';
$type='free';
$page_num=1;
$url=$url.'?type='.$type.'&page='.$page_num;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$scraped = curl_exec($ch);
curl_close($ch);
FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped);
preg_match_all('/<a href="\/links.php\?type=free\&amp;pagenum=(.*)">(\d)<\/a>/', 
$scraped, $pages, PREG_SET_ORDER);
$newArr = array();
foreach ($pages as $val) {
$newArr[$val[2]] = $val;
}
$pages = array_values($newArr);
$page_count=count($pages);
$page_count++;
for ($row1 = 0; $row1 < $page_count; $row1++) {
$url=$url.'?type='.$type.'&page='.$page_num;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$scraped = curl_exec($ch);
curl_close($ch);
FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped);
preg_match_all('/url=(.*)" target="_blank">(.*)<\/a>/i', $scraped, $links, PREG_SET_ORDER);
$rowcount=count($links);
for ($row2 = 0; $row2 < $rowcount; $row2++) {
unset($links[$row2][0]);
if(!preg_match('/^http:\/\//i', $links[$row2][1])) {
$links[$row2][1] = 'http://'.$links[$row2][1];
}
$links[$row2][1]=preg_replace('/%2f/i', '/', $links[$row2][1]);
$urls=$links[$row2][1];
$title=$links[$row2][2];
$sql="SELECT * FROM website_directory WHERE url='$urls'";
$q=MYSQLI_QUERY($link, $sql);
if(MYSQLI_FETCH_ASSOC($q) == 0) {
$insert="INSERT INTO website_directory (title, url)VALUE('$title', '$urls')";
$q=MYSQLI_QUERY($link, $insert);
$add++;
}
$site++;
}
$page_num++;
}
echo $add.' sites have been added to the database.</br>';
echo $site.' have been scanned.</br>';
$end=microtime(true);
$total=$end - $start;
echo '</br>'.$total;
?>

Code (markup):

Thanks for any assistance.

Michael

mbaldwin, Aug 21, 2011 IP

nonte Active Member

Messages:: 72

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 73

#7

There you go :-

<?php
$start=microtime(true);
set_time_limit(0);
ignore_user_abort();
require_once('db_config.php');
$site='';
$add='';
$url='http://www.onewaytextlink.com/links.php';
$type='free';
$page_num=1;
$url=$url.'?type='.$type.'&pagenum='.$page_num;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$scraped = curl_exec($ch);
curl_close($ch);
FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped);
preg_match_all('/<a href="\/links.php\?type=free\&amp;pagenum=(.*)">(\d)<\/a>/',
$scraped, $pages, PREG_SET_ORDER);
$newArr = array();
foreach ($pages as $val) {
$newArr[$val[2]] = $val;
}
print_r($pages);
$pages = array_values($newArr);
$page_count=count($pages);
$page_count++;
for ($row1 = 0; $row1 < $page_count; $row1++) {
$url=$url.'?type='.$type.'&pagenum='.$page_num;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$scraped = curl_exec($ch);
curl_close($ch);
FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped);
preg_match_all('/url=(.*)" target="_blank">(.*)<\/a>/i', $scraped, $links, PREG_SET_ORDER);
$rowcount=count($links);
for ($row2 = 0; $row2 < $rowcount; $row2++) {
unset($links[$row2][0]);
if(!preg_match('/^http:\/\//i', $links[$row2][1])) {
$links[$row2][1] = 'http://'.$links[$row2][1];
}
$links[$row2][1]=preg_replace('/%2f/i', '/', $links[$row2][1]);
$urls=$links[$row2][1];
$title=$links[$row2][2];
$sql="SELECT * FROM website_directory WHERE url='$urls'";
echo $sql."\n";
$q=MYSQLI_QUERY($link, $sql);
if(MYSQLI_FETCH_ASSOC($q) == 0) {
$insert="INSERT INTO website_directory (title, url)VALUE('$title', '$urls')";
$q=MYSQLI_QUERY($link, $insert);
$add++;
}

$site++;
}
$page_num++;
}
echo $add.' sites have been added to the database.</br>';
echo $site.' have been scanned.</br>';
$end=microtime(true);
$total=$end - $start;
echo '</br>'.$total;
?>

PHP:

nonte, Aug 21, 2011 IP

ssmm987 Member

Messages:: 180

Likes Received:: 4

Best Answers:: 3

Trophy Points:: 43

#8

The problem is fairly simple:


$url='http://www.onewaytextlink.com/links.php'; //url =http://www.onewaytextlink.com/links.php

PHP:

$url=$url.'?type='.$type.'&pagenum='.$page_num;  //url=http://www.onewaytextlink.com/links.php?type=0&pagenum=0

PHP:

//url=http://www.onewaytextlink.com/links.php?type=0&pagenum=0
And again:


$url=$url.'?type='.$type.'&pagenum='.$page_num;  //url=http://www.onewaytextlink.com/links.php?type=0&pagenum=0?type=1&pagenum=1

PHP:

The get variables are added to the get variables from the request done before that.

Simple fix:

<?php
$start=microtime(true);
set_time_limit(0);
ignore_user_abort();
require_once('db_config.php');
$site='';
$add='';
$baseurl='http://www.onewaytextlink.com/links.php';
$type='';
$page_num=1;
$url=$baseurl.'?type='.$type.'&pagenum='.$page_num;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$scraped = curl_exec($ch);
curl_close($ch);
FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped);
preg_match_all('/<a href="\/links.php\?type=free\&amp;pagenum=(.*)">(\d)<\/a>/',
$scraped, $pages, PREG_SET_ORDER);
$newArr = array();
foreach ($pages as $val) {
$newArr[$val[2]] = $val;
}
print_r($pages);
$pages = array_values($newArr);
$page_count=count($pages);
$page_count++;
for ($row1 = 0; $row1 < $page_count; $row1++) {
$url=$baseurl.'?type='.$type.'&pagenum='.$page_num;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$scraped = curl_exec($ch);
curl_close($ch);
FILE_PUT_CONTENTS('scraped_page_'.$page_num.'.txt', $scraped);
preg_match_all('/url=(.*)" target="_blank">(.*)<\/a>/i', $scraped, $links, PREG_SET_ORDER);
$rowcount=count($links);
for ($row2 = 0; $row2 < $rowcount; $row2++) {
unset($links[$row2][0]);
if(!preg_match('/^http:\/\//i', $links[$row2][1])) {
$links[$row2][1] = 'http://'.$links[$row2][1];
}
$links[$row2][1]=preg_replace('/%2f/i', '/', $links[$row2][1]);
$urls=$links[$row2][1];
$title=$links[$row2][2];
$sql="SELECT * FROM website_directory WHERE url='$urls'";
echo $sql."\n";
$q=MYSQLI_QUERY($link, $sql);
if(MYSQLI_FETCH_ASSOC($q) == 0) {
$insert="INSERT INTO website_directory (title, url)VALUE('$title', '$urls')";
$q=MYSQLI_QUERY($link, $insert);
$add++;
}

$site++;
}
$page_num++;
}
echo $add.' sites have been added to the database.</br>';
echo $site.' have been scanned.</br>';
$end=microtime(true);
$total=$end - $start;
echo '</br>'.$total;

PHP:

ssmm987, Aug 22, 2011 IP

mbaldwin Active Member

Messages:: 215

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 95

#9

Hi,
Thanks to both of you for finding a problem. It took me a while to figure out what nont had changed, but after I got it, well, it just shows how something so simple can be over looked.
Thanks, ssmm987 for figuring out that the get was being appended after the other get parameters. I should have caught that mistake to.

Works great now, thanks again.

Michael

mbaldwin, Aug 22, 2011 IP

merlinghost Peon

Messages:: 1

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#10

if i put a special caractere to authentificate a user using GET parameter this not work

exemple of pass : Camilie&

the & in the Get url do a big problem

Sorry fo my english and thanks for replay

merlinghost, Dec 6, 2012 IP

Log in or Sign up

cURL with get parameters

mbaldwin Active Member

ssmm987 Member

mbaldwin Active Member

iBank ™ Peon

mbaldwin Active Member

mbaldwin Active Member

nonte Active Member

ssmm987 Member

mbaldwin Active Member

merlinghost Peon

Useful Searches