View Full Version : Google indexed pages checker
terryuk
Dec 3rd 2007, 2:28 pm
Hey,
I'm having troubles with abit of PHP code Im trying to use to return the amount of indexed pages a website has...
Heres the code;
<?php $data = implode('', file("http://www.google.com/search?q=site:www.$name"));
preg_match_all("|Results <b>[0-9]+</b> - <b>[0-9]+</b> of [a-z ]*<b>([0-9]*)</b>|U",
$data,
$out, PREG_PATTERN_ORDER);
$results = intval($out[1][0]);
$nowww["google"] = $results; echo($nowww[google]);?>
Anyone got ideas? It just keeps coming out as 0 :o
hogan_h
Dec 3rd 2007, 3:05 pm
Try this:
<?php $data = implode('', file("http://www.google.com/search?q=site:www.$name"));
preg_match_all("|Results <b>[0-9]+</b> - <b>[0-9]+</b> of [a-z ]*<b>([0-9,]*)</b>|U",
$data,
$out, PREG_PATTERN_ORDER);
$results = intval(str_replace(",","",$out[1][0]));
$nowww["google"] = $results;
echo($nowww[google]);?>
Explanation:
For particular pages i got few results, which contained "," inside, so you need to consider them too. Maybe it's localization issue, maybe you need to consider "." too, but adding "," into the pattern and then stripping after, worked for me.
terryuk
Dec 3rd 2007, 3:33 pm
Thanks for the reply, but it's just returning 0 for me :\
hogan_h
Dec 3rd 2007, 3:40 pm
What site are you trying?
When i use for example $name="cnn.com" with modified script version from above i get:
291000
terryuk
Dec 3rd 2007, 3:50 pm
Well I just tried it with cnn.com too but comes up with 0
hogan_h
Dec 3rd 2007, 3:59 pm
Ok, then we need to debug it :)
Take this and let me know what results you get:
<?php
$name="spiegel.com";
$data = implode('', file("http://www.google.com/search?q=site:www.$name"));
echo "<pre>";
print_r($data);
echo "</pre>";
echo "<hr />";
preg_match_all("|Results <b>[0-9]+</b> - <b>[0-9]+</b> of [a-z]* <b>([0-9,]*)</b>|U",
$data,
$out, PREG_PATTERN_ORDER);
$results = intval(str_replace(",","",$out[1][0]));
echo "<pre>";
print_r($out);
echo "</pre>";
$nowww["google"] = $results; echo($nowww['google']);
?>
I get results page, but the string of interest is this:
Results 1 - 10 of about 31,200 from www.spiegel.com. (0.03 seconds)
And then at the bottom i get:
Array
(
[0] => Array
(
[0] => Results 1 - 10 of about 31,200
)
[1] => Array
(
[0] => 31,200
)
)
31200
What do you get with this same script ("string of interest" + Array content)?
terryuk
Dec 3rd 2007, 4:12 pm
It seems like it was about to work, but I think Google may have a limit on so many remote queries a day? As I just got a 403 forbidden page.
hogan_h
Dec 3rd 2007, 4:17 pm
If you have another server then try it there...
Otherwise if you are testing locally and have dynamic IP, you could try reconnecting...
I find it rather strange, that you are getting 403 error...
terryuk
Dec 4th 2007, 1:40 am
I'm getting this message;
We're sorry...
... but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now.
hogan_h
Dec 4th 2007, 1:47 am
I'm sorry man... The code is correct and the same code works for me locally and yesterday i tested it on external server, it worked. If you had too many queries, then it's possible your server got "flagged". Maybe you could try using some of the proxy sites+cUrl combo...
terryuk
Dec 4th 2007, 2:33 am
Thanks for your help hogan_h, just found a solution using Google API (http://www.useseo.com/google-api-demo.php)
hogan_h
Dec 4th 2007, 3:24 am
You are wellcome ;)
Just for your information, if you are using google api casually, you will be fine, otherwise you should know that it has limited number of daily queries (1000). If that becomes a problem for you, you should take a look into Google Ajax API.
http://code.google.com/apis/soapsearch/api_faq.html#gen12
ognos
Dec 4th 2007, 9:35 am
It's my code for this :
$fetch_url = "http://www.google.pl/search?hl=pl&q=site:".$site."&btnG=Szukaj&lr=";
ob_start();
include_once($fetch_url);
$page = ob_get_contents();
ob_end_clean();
$page = str_replace(',','',$page);
preg_match_all('/<b>(\d+)/', $page, $wynik );
echo $wynik[0][2];
vBulletin® v3.8.4, Copyright ©2000-2009, Jelsoft Enterprises Ltd.