1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

How to check with PHP if a string is found in Google's index?

Discussion in 'PHP' started by KISS, Dec 15, 2009.

  1. #1
    Is it possible to check with php if a string is found in google? In other words if the result count is zero or false or something like this. I am thinking about Google AJAX Search API but I don't find in the docs about that.
     
    KISS, Dec 15, 2009 IP
  2. AlexKey

    AlexKey Peon

    Messages:
    22
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Google AJAX Search API is not an issue, cause it is only for client-side and only for embedding live search from google to your web page.

    You can use Curl php plugin and do the GET request to google search page - parse the result. But this is a "black" way :)
     
    AlexKey, Dec 15, 2009 IP
  3. taminder

    taminder Peon

    Messages:
    581
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #3
    scrape the results and and if you find "did not match any documents." on the screen, that means you don't have any results.
     
    taminder, Dec 15, 2009 IP
  4. KISS

    KISS Active Member

    Messages:
    135
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    53
    #4
    Thanks. I found php code using curl that returns the source code of a given URL. This works for yahoo and bing, but not for google. Well it works partially but it does not return the results of the search, it just returns some code which I don't need. I think that I must have some sort of key from google so I can access them with curl. Is someone familiar with this?
     
    KISS, Dec 17, 2009 IP
  5. juust

    juust Peon

    Messages:
    214
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #5
    A google search soap key (for their discontinued search web service) is limited to 1000 requests, the keys are scarce and rather expensive.

    The alternative is scraping, I would not encourage anyone to scrape google serps or use elite proxies if you scrape more than 2000 pages a day, but if you did, you could do a simple strpos on 'did not match' (Google stuff the 'did not match' remark in javascript in the head section so one would check for it in the body section) like this :

    
    $result=" is not present";
    
    $query = 'jughihiust' ;
    
    $haystack = file_get_contents('http://www.goooogle.com/search?hl=en&q=' .urlencode($query));
    $needle = 'did not match';
    
    if(strpos($haystack, $needle, strpos($haystack, 'body'))===false) $result=" is present";
    
    echo $query.$result;
    
    PHP:
    Some would find the number of results more interesting, which is the first string after the first occurence of 'of about '

    
    $query = 'google' ;
    
    $haystack = strip_tags(file_get_contents('http://www.goooogle.com/search?hl=en&q=' .urlencode($query)));
    $needle = 'of about ';
    if(strpos($haystack, $needle)===false) {
       $result="no "; 
    } else {
       $start = strpos($haystack, $needle) + 9;
       $end = strpos($haystack, ' ', $start);
       $result = substr($haystack, $start, $end-$start);
    }
    echo $query.': '..$result.' results';
    
    
    PHP:
    something like that.
     
    juust, Dec 17, 2009 IP
  6. KISS

    KISS Active Member

    Messages:
    135
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    53
    #6
    Wow thanks juust! It works.
    I am interested in the first code you wrote. Is this forbidden or something? If I use it many times a day will they ban the IP or punish my site.
    And is there a reason you wrote goooogle.com instead of google.com. Because with the goooogle.com it does not work.

    EDIT: Yes it seems that this is forbidden :( So they can scrape ALL the information from billions of sites even without permission and even with copyright materials, but nobody can scrape them. Wow what a policy, it is great (for them).
     
    Last edited: Dec 17, 2009
    KISS, Dec 17, 2009 IP
  7. ghprod

    ghprod Active Member

    Messages:
    1,010
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    78
    #7
    hahhahha .... they're GOD in the internet :p

    but i've ever seen flash based on google ajax search ... maybe this can be usefull for u :)
     
    ghprod, Dec 18, 2009 IP
  8. juust

    juust Peon

    Messages:
    214
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Google mention automated queries as a breach of their T.o.S., but sanctions are mild.

    IP Bans:
    if your pc (or the webhost you run code on) requests 2000 pages in a short time, Google temporary ban the IP from accessing their servers. They lift the ban after a while and do not penalize the site. Most scrapers use elite proxies, so Google can not trace the origin of the scraper request. Some use Tor/Vidalia when working from a desktop install.

    PR penalties:
    If you put automated queries (like http ://www.goooogle.com/?q=term]) on a webpage, every time a crawler enters the page it triggers the query on Google. That messes up their stats, including the adwords stats. If Google catch you listing a lot of search-url's, they can reduce the sites pagerank to 0, but they do not deindex the site for that.
     
    Last edited: Dec 19, 2009
    juust, Dec 19, 2009 IP
  9. KISS

    KISS Active Member

    Messages:
    135
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    53
    #9
    Well in my case there will be no triggering the queries by bots. The idea is to check if a string is in google's index after submiting a form with captcha for antibots.
    So if I don't use proxy how many queries can I do, like 1000 a day or more?
    And if I use proxy, could you explain a little how it is done and is it forbidden by hosting companies or something else that I should know?
    Thanks!
     
    KISS, Dec 20, 2009 IP
  10. juust

    juust Peon

    Messages:
    214
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #10
    I used to take 2000 queries a day as limit and keep it below 1500.

    Using php cUrl to grab pages, you can specify a proxy ip and port and it will route the request over the proxy. The proxy lists I scraped of Samair which usually gave me 10 to 20 proxies. I found it slow and unreliable, proxies can take 3 to 5 seconds to return a page when the proxy owner replaces ads with their own ads and run on shared hosting accounts.

    So I built a tiny script (like the one above) to grab pages off Google and echo the result back to the main server. I put them on some free hosted accounts and routed requests over them, that saved 3 hours for 5000 requests (<1 second per request) and more reliable.

    Hosts can consider that abuse of their server resources, depends on their T.o.S. Some allow proxies, then it's okay.
     
    juust, Dec 20, 2009 IP