MSN is crashing my PHP script - need help with debugging

Discussion in 'PHP' started by Christian Little, Mar 5, 2008.

  1. #1
    I'm building a SEO script to check pagerank, backlinks, and various other items for any given url. And for the most part it's coming together nicely, but I've hit a wall.

    First off, this is the script: http://christianlittle.com/dev/PR/

    Right now if you enter a domain, it will pull up the pagerank and backlinks in the big 3 engines for you.

    This is the code for the function that pulls the data:

    
    function webpageRegex($url, $match) {
        $site = fopen($url,'r'); 
        while($cont = fread($site,1024657)){ 
            $total .= $cont; 
        } 
        fclose($site); 
        preg_match($match,$total,$matches); 
        return $matches[1]; 
    }
    
    PHP:
    What this does is you pass in $url as any url, and then $match is a regular expression that it will search the page for. Now what I've done is setup a list of URL's and regex patterns in a database.

    Here are the three domains that are checked (I'm adding more soon, but I want to get the big 3 working first):

    Google Backlinks
    $url = "http://www.google.com/search?q=link%3A##URL##";
    $match = "/of about <b>(.*)<\/b> linking to/Us";

    Yahoo Backlinks
    $url = "http://siteexplorer.search.yahoo.com/search?p=http%3A%2F%2F##URL##&bwm=i&bwmf=u&bwms=p&fr2=seo-rd-se";
    $match = "/of about <strong>(.*)<\/strong>/Us";

    MSN Backlinks
    $url = "http://search.msn.com/results.aspx?q=link%3A+##URL##";
    $match = "/of (.*)<\/span>/Us";

    Before webpageRegex is called, $url is parsed through a simple ereg_replace call to change ##URL## into the domain that is entered in the web form.

    Pretty straight forward, and if you try out the link above you'll see that this works nicely.

    However, MSN seems to have an infrequent problem, which is making it a real pain to figure out. About 20% of the time, instead of getting the backlinks from MSN, my results page gets the following chunk of code from MSN:

    Now since this script works about 75-80% of the time with MSN, I'm assuming it's something on their end, maybe some bad javascript or something on their pages. But can anybody help me figure this out? Is there something wrong with my code, or is this strictly a problem on msn.com?

    Again, this works perfectly fine for Yahoo and Google thankfully. It's just MSN, so I'm wondering if my regex pattern is bad. If it is my regex pattern, can somebody help me refine it to get rid of this problem?

    Thanks :)
     
    Christian Little, Mar 5, 2008 IP
  2. Christian Little

    Christian Little Peon

    Messages:
    1,753
    Likes Received:
    80
    Best Answers:
    0
    Trophy Points:
    0
    #2
    I figured it out, here's what happened in case anybody else has this problem.

    When you do a backlink search on MSN, the specific result text is very different if there are no links (whereas Google and Yahoo just say 0 links). So what I found was the the results I was getting were 0 words in length for when the pages returned results, but when MSN couldn't find backlinks it was spitting out this chunk of error code that was 400 words.

    So I added a function to check the number of words in the resulting match, if it was 0 then it returned it as that was the right number of words. If it was 1 or higher, then the function would return 0 links found - and this seems to work.

    Here's the code if anybody wants it:
    
    function webpageRegex($url, $match) {
    //Pulls specific data from the first page of a search engine results page
       $site = fopen($url,'r'); 
       while($cont = fread($site,1024657)){ 
           $total .= $cont; 
       } 
       fclose($site); 
       preg_match($match,$total,$matches); 
       if($matches[1]) { 
         if(str_word_count($matches[1])) {
             return 0;
           } else {
             return $matches[1];
           }
         }
       else { 
         return 0; 
         }
    }
    
    PHP:
     
    Christian Little, Mar 5, 2008 IP