How to Scrape Specific Information from a Website using PHP

Discussion in 'PHP' started by WPC, Oct 31, 2011.

  1. #1
    How to Scrape Specific Information from a Website using PHP


    First, We'll open notepad2 or whatever IDE you use. I choose notepad2 as it's smaller and I prefer to hand code all myself.
    Second, we need a website where we will scrape information from, so for this tutorials purpose we will use "http://new.whois.net/"


    This website displays all whois information from a domain provided. ei: "http://new.whois.net/domainhere.com"


    So here is what we need to start of with:


    
    <?php
    /**
    * @author Vick@CoderzSpot.com
    * @website www.CoderzSpot.com
    * @version 1.0
    * @date(25-10-2011)
    * @contact vick@hotmail.com.au
    **/
    
    
    ?>
    
    Code (markup):

    That's pretty much a template I tend to use but you get the point, this is so if other users use your script they know who to contact for support.


    Next this is how we write a function using regex to scrape the "Registrar" of the domain.


    
        function GetRegistrar($url)
        {
            $output = file_get_contents("http://new.whois.net/whois/$url");
            $regex = '/Registrar: (.+?) Whois/';
            preg_match($regex, $output, $match);
            echo "Registrar: ".$match[1];
        }
    
    Code (markup):

    THe above code, in the function GetRegistrar($url), we pass the $url variable which holds the website url of example "coderzspot.com" which will go to the website whois and use coderzspot.com to get it's registrar information.


    Then we use Regex to find specific information/html tags/text, the (.+?) represents what we are looking for and on both sides is/should be the tags/text of what we want for example, "dog rabbit rat", if we want rabbit, we put dog and rat on those sides because we want the middle one, for example <body>text</body>, we want text so we need to tell php what is on other sides.


    So using preg_match we throw the regex, the url, and the match, is where it will be stored, using "var_dump($match)" can show the complete array of information on what id we should print out, we are printing out $match[1] because it has the information we need. this would print out ENOM. INC or similar.


    Thank's for reading.


     
    WPC, Oct 31, 2011 IP
  2. ahsan11223

    ahsan11223 Active Member

    Messages:
    725
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    73
    #2
    hello friend you made comment in my design.

    well i was reading your this thread. why its dime. or maybe some problem with my browser. lols its about programming. i think you are good with programming. i am going to send you a pm. thanks


     
    ahsan11223, Nov 2, 2011 IP
  3. WPC

    WPC Peon

    Messages:
    116
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Hi,

    Yeah, FF screwed up while posting this. I switched to chrome instantly.
     
    WPC, Nov 2, 2011 IP